Definition
A heavily optimized way to compute attention on GPUs that uses memory more carefully.
A fused-kernel implementation of exact attention that reduces HBM traffic by tiling and recomputation, dramatically lowering memory and improving throughput.
Also called: FlashAttention-2