Loading
Loading
Original transformers from 2017 were slow and used too much memory. These innovations are optimizations that let us build models that are bigger, faster, and can handle longer text.
GQA shares keys and values across query heads. Instead of each "brain" having its own memory, they share a smaller amount.
Lots of memory!
Fast but loses quality
Best of both!
4-8× less KV cache memory. For 128K context, saves 100GB+ VRAM.
2-3× faster. Used by Llama 3, Qwen 2, Mistral.
MLAfrom DeepSeek uses a single "latent" that decomposes into all Q, K, V values. Think of it like a compressed file that expands when needed.
Each token has separate Q, K, V matrices
Single latent → decompresses to all
80% less KV memory | 2× throughput | Matches quality with 18% of parameters
MoEhas many specialized "experts" but only uses a couple at a time. It's like having 8 specialists but only paying for 2.
Only 2 experts activate per token! (click to try)
Mixtral 8×7B has 45B params but runs like 12B.
Different experts can specialize: math, coding, languages.
Only 2 experts activate = constant compute regardless of total params.
Standard attention is a bottleneck - all info must flow through it.Residualsadd "shortcuts" that skip the attention step entirely.
Alternating "attention-free" stripes improve long-context by 4×.
Hybrid models that replace most attention with linear state-space models.