LESSON 4

Transformer Block

Putting it all together. The complete layer.

🧱What is a transformer block?

A transformer block is like a single "step" in processing. It has two parts: 1) Attention - words talk to each other 2) FFN - processes each word to add knowledge

Input

→

Attention

→

FFN

→

Output

What's inside

Input: word vectors

LayerNorm

Multi-Head Attention

+ add original input (residual)

LayerNorm

Feed-Forward Network

+ add original input (residual)

Output: transformed vectors

The FFN: where knowledge lives

The Feed-Forward Network takes ~60% of all parameters! It's simple but powerful:

Input

4096

→

Expand

16384 (4x)

→

Output

4096

Think of it like this:

It expands the information, processes it, then compresses back. This is where the model "thinks" about what each word means.

Where is knowledge?

Attention routes to the right patterns, but FFN stores the actual knowledge. That's why it needs most of the parameters.

Pre-Norm vs Post-Norm

✓ Pre-Norm (Modern)

output = x + attention(norm(x))

• Normalize BEFORE - works better!
• Can stack 80+ layers
• Used by Llama, Mistral, Qwen

✗ Post-Norm (Old)

output = norm(x + attention(x))

• Normalize AFTER - unstable!
• Breaks with many layers
• Only ~6-12 layers max

How big is each part?

Part	7B Model	% of Block
Attention	~67M	~42%
FFN	~90M	~57%
LayerNorms	~16K	<1%
Total per block	~157M	100%

× 32 layers = ~5B parameters for a 7B model!

Previous: Attention Next: KV Cache

Back to The Transformer

LESSON 4

Transformer Block

Putting it all together. The complete layer.

🧱What is a transformer block?

A transformer block is like a single "step" in processing. It has two parts: 1) Attention - words talk to each other 2) FFN - processes each word to add knowledge

Input

→

Attention

→

FFN

→

Output

What's inside

Input: word vectors

LayerNorm

Multi-Head Attention

+ add original input (residual)

LayerNorm

Feed-Forward Network

+ add original input (residual)

Output: transformed vectors

The FFN: where knowledge lives

The Feed-Forward Network takes ~60% of all parameters! It's simple but powerful:

Input

4096

→

Expand

16384 (4x)

→

Output

4096

Think of it like this:

It expands the information, processes it, then compresses back. This is where the model "thinks" about what each word means.

Where is knowledge?

Attention routes to the right patterns, but FFN stores the actual knowledge. That's why it needs most of the parameters.

Pre-Norm vs Post-Norm

✓ Pre-Norm (Modern)

output = x + attention(norm(x))

• Normalize BEFORE - works better!
• Can stack 80+ layers
• Used by Llama, Mistral, Qwen

✗ Post-Norm (Old)

output = norm(x + attention(x))

• Normalize AFTER - unstable!
• Breaks with many layers
• Only ~6-12 layers max

How big is each part?

Part	7B Model	% of Block
Attention	~67M	~42%
FFN	~90M	~57%
LayerNorms	~16K	<1%
Total per block	~157M	100%

× 32 layers = ~5B parameters for a 7B model!

Previous: Attention Next: KV Cache