Loading
Loading
A transformer block is like a single "step" in processing. It has two parts: 1) Attention - words talk to each other 2) FFN - processes each word to add knowledge
The Feed-Forward Network takes ~60% of all parameters! It's simple but powerful:
It expands the information, processes it, then compresses back. This is where the model "thinks" about what each word means.
Attention routes to the right patterns, but FFN stores the actual knowledge. That's why it needs most of the parameters.
output = x + attention(norm(x))output = norm(x + attention(x))| Part | 7B Model | % of Block |
|---|---|---|
| Attention | ~67M | ~42% |
| FFN | ~90M | ~57% |
| LayerNorms | ~16K | <1% |
| Total per block | ~157M | 100% |
× 32 layers = ~5B parameters for a 7B model!