Glossary · Term

FlashAttention

← all terms

Definition

A heavily optimized way to compute attention on GPUs that uses memory more carefully.

A fused-kernel implementation of exact attention that reduces HBM traffic by tiling and recomputation, dramatically lowering memory and improving throughput.

Also called: FlashAttention-2

Mentioned in 2 episodes

  1. 036
    Sparse Attention Was the Wrong Frame. Treat It as Geometry Instead.
  2. 027
    When AI Agents Build the Serving Stack: A Bet on Bespoke Infrastructure