LESSON 6

Modern Innovations

Efficiency tricks that make huge models possible.

🚀Why do we need these?

Original transformers from 2017 were slow and used too much memory. These innovations are optimizations that let us build models that are bigger, faster, and can handle longer text.

2017: Transformer

→

GQA (2023)

→

MLA (2024)

→

MoE

Grouped-Query Attention (GQA)

GQA shares keys and values across query heads. Instead of each "brain" having its own memory, they share a smaller amount.

Original (2017)

Lots of memory!

Multi-Query (Fast)

Fast but loses quality

GQA (Balanced)

Best of both!

Memory savings

4-8× less KV cache memory. For 128K context, saves 100GB+ VRAM.

Speed

2-3× faster. Used by Llama 3, Qwen 2, Mistral.

Multi-Latent Attention (MLA)

MLAfrom DeepSeek uses a single "latent" that decomposes into all Q, K, V values. Think of it like a compressed file that expands when needed.

Standard

Token

→

QKV

Each token has separate Q, K, V matrices

MLA

Token

→

QKV

Single latent → decompresses to all

80% less KV memory | 2× throughput | Matches quality with 18% of parameters

Mixture of Experts (MoE)

MoEhas many specialized "experts" but only uses a couple at a time. It's like having 8 specialists but only paying for 2.

Click experts to activate!

Input

Router

Output

Only 2 experts activate per token! (click to try)

Massive scale

Mixtral 8×7B has 45B params but runs like 12B.

Specialists

Different experts can specialize: math, coding, languages.

Constant speed

Only 2 experts activate = constant compute regardless of total params.

Attention Residuals

Standard attention is a bottleneck - all info must flow through it.Residualsadd "shortcuts" that skip the attention step entirely.

Standard

Input

→

Attention

→

Output

With Residuals

Input

→

Attention

→

Output

Input

Direct

→

StripedAttention

Alternating "attention-free" stripes improve long-context by 4×.

Jamba, Mamba

Hybrid models that replace most attention with linear state-space models.

Previous: KV Cache Back to The Transformer

Back to The Transformer

LESSON 6

Modern Innovations

Efficiency tricks that make huge models possible.

🚀Why do we need these?

Original transformers from 2017 were slow and used too much memory. These innovations are optimizations that let us build models that are bigger, faster, and can handle longer text.

2017: Transformer

→

GQA (2023)

→

MLA (2024)

→

MoE

Grouped-Query Attention (GQA)

GQA shares keys and values across query heads. Instead of each "brain" having its own memory, they share a smaller amount.

Original (2017)

Lots of memory!

Multi-Query (Fast)

Fast but loses quality

GQA (Balanced)

Best of both!

Memory savings

4-8× less KV cache memory. For 128K context, saves 100GB+ VRAM.

Speed

2-3× faster. Used by Llama 3, Qwen 2, Mistral.

Multi-Latent Attention (MLA)

MLAfrom DeepSeek uses a single "latent" that decomposes into all Q, K, V values. Think of it like a compressed file that expands when needed.

Standard

Token

→

QKV

Each token has separate Q, K, V matrices

MLA

Token

→

QKV

Single latent → decompresses to all

80% less KV memory | 2× throughput | Matches quality with 18% of parameters

Mixture of Experts (MoE)

MoEhas many specialized "experts" but only uses a couple at a time. It's like having 8 specialists but only paying for 2.

Click experts to activate!

Input

Router

Output

Only 2 experts activate per token! (click to try)

Massive scale

Mixtral 8×7B has 45B params but runs like 12B.

Specialists

Different experts can specialize: math, coding, languages.

Constant speed

Only 2 experts activate = constant compute regardless of total params.

Attention Residuals

Standard attention is a bottleneck - all info must flow through it.Residualsadd "shortcuts" that skip the attention step entirely.

Standard

Input

→

Attention

→

Output

With Residuals

Input

→

Attention

→

Output

Input

Direct

→

StripedAttention

Alternating "attention-free" stripes improve long-context by 4×.

Jamba, Mamba

Hybrid models that replace most attention with linear state-space models.

Previous: KV Cache Back to The Transformer