LESSON 3

Attention

How the model decides which words matter.

👁️What is attention?

When reading a sentence, you focus on some words more than others.Attention is the same for AI - it lets each word "look at" every other word to figure out which ones matter for understanding the current word.

The

bank

will

5pm

When processing "bank", the model looks back at all words to understand context

How it works

1️⃣

Create queries

For each word, create a "question" - what am I looking for?

2️⃣

Compare with keys

Compare each question with all other words' "answers" (keys)

3️⃣

Weight & combine

Focus on the most relevant words, combine their meanings

Different types of attention

MHA

High Memory

Multi-Head Attention

The Original (2017)

The model has multiple "heads" to look at different parts of the sentence simultaneously. Like having 8 different readers analyzing the text for grammar, tone, facts, etc.

MQA

Low Memory

Multi-Query Attention

The Speed Demon

All heads share a single memory cache. Drastically reduces the memory needed, allowing the model to generate text much faster. 10x throughput.

GQA

Balanced

Grouped-Query Attention

The Llama Standard

A smart compromise. Heads are grouped into clusters. Retains the quality of the original with the speed of the modern era.

∞

O(N) Scale

Sparse / Scaling

Infinite Context

The model doesn't look at every single previous word. It only looks at the important ones, or a local window. Allows for massive book-sized inputs.

O(1)

Constant

Linear / SSM

No Lookback

State Space Models (like Mamba). They digest the text in a single pass and forget the raw data, keeping only a "compressed state". Infinite context, constant memory.

Evolution of Efficiency

2017Multi-Head

→

2019Multi-Query

→

2023Grouped-Query

→

2024Linear/SSM

Back to The Transformer