{"id":988,"date":"2025-06-06T18:45:01","date_gmt":"2025-06-06T18:45:01","guid":{"rendered":"https:\/\/web.eecs.umich.edu\/~girasole\/?p=988"},"modified":"2025-06-06T18:45:01","modified_gmt":"2025-06-06T18:45:01","slug":"monarch-attention","status":"publish","type":"post","link":"https:\/\/web.eecs.umich.edu\/~girasole\/?p=988","title":{"rendered":"Monarch Attention"},"content":{"rendered":"\n<p>The attention module in transformer architectures is often the most computation and memory intensive unit. Many researchers have tried different ways to approximate softmax attention in a compute efficient way. We have a new approach that uses the Monarch matrix structure along with variational softmax to quickly and accurately approximate softmax attention in a zero-shot setting. The results are very exciting &#8212; we can significantly decrease the compute and memory requirements while taking at most a small hit to performance. This figure shows the performance versus computation of our &#8220;Monarch-Attention&#8221; method as compared to Flash Attention 2 (listed as &#8220;softmax&#8221;) and other fast approximations. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/web.eecs.umich.edu\/~girasole\/wp-content\/uploads\/2025\/vit_roberta_results.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>See the paper for additional results, including hardware benchmarking against Flash Attention 2 on several sequence lengths.<\/p>\n\n\n\n<p>Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano. &#8220;MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention.&#8221;<br><a href=\"https:\/\/arxiv.org\/abs\/2505.18698\">https:\/\/arxiv.org\/abs\/2505.18698<\/a><br><a href=\"https:\/\/github.com\/cjyaras\/monarch-attention\">Code can be found here.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The attention module in transformer architectures is often the most computation and memory intensive unit. Many researchers have tried different ways to approximate softmax attention in a compute efficient way. We have a new approach that uses the Monarch matrix structure along with variational softmax to quickly and accurately approximate softmax attention in a zero-shot [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,4,13],"tags":[],"_links":{"self":[{"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/posts\/988"}],"collection":[{"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=988"}],"version-history":[{"count":3,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/posts\/988\/revisions"}],"predecessor-version":[{"id":991,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=\/wp\/v2\/posts\/988\/revisions\/991"}],"wp:attachment":[{"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=988"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=988"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/web.eecs.umich.edu\/~girasole\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=988"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}