FlashAttention · Glossary · AI Papers: A Deep Dive

Definition

Plain language

A heavily optimized way to compute attention on GPUs that uses memory more carefully.

As stated in the literature

A fused-kernel implementation of exact attention that reduces HBM traffic by tiling and recomputation, dramatically lowering memory and improving throughput.

Also called: FlashAttention-2

Why it matters: It made long-context transformers practical by removing a memory wall, and is now the default attention kernel in most training stacks.

For example, swapping a standard attention implementation for FlashAttention can cut a long-context training run's memory use and speed it up noticeably without changing what the model computes.

Heard on the show

“But the more interesting comparison is to FlashAttention — the highly optimized, hardware-aware dense attention kernel that's basically the baseline everyone competes against in inference work.”

Episode 036 — Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.

Mentioned in 2 episodes

Related terms

attention kernel throughput