Primer · units

What is a parameter?

Three different numbers run into the trillions in every frontier-model launch announcement. They mean different things. This is the decoder.

Section 01

The three trillions you keep seeing

DeepSeek V4 launches at 1.6T parameters. Anthropic's Mythos is rumoured at 10T parameters. Llama 3 was trained on 15T tokens. GPT-3 had 175B parameters and saw 300B tokens.

Three different units. They get reported with the same suffix (B, T) and they all sound like "size of the model." They are not the same.

Section 02

Parameters are the brain. Tokens are the experience.

A parameter is one learned number — a single weight inside one of the model's matrices. A 1.6T-parameter model has 1.6 trillion such numbers, each adjusted during training to encode some sliver of statistical pattern.

A token is one chunk of text — a word, a piece of a word, sometimes a single character. Tokenisers break input strings into tokens, and training-data sizes are measured in *how many* tokens the model has seen during training.

Mental model: parameters = size of the brain (how many tunable knobs). Training tokens = size of the experience (how much it has read).

Two trillions, two different things

cycle · 0

Parameters

Tunable knobs inside the model

GPT-3 (2020)

175B

DeepSeek V4

1.6T

Mythos (Anthropic)

10.0T

Training tokens

Text fragments the model has seen during training

GPT-3

300B

Llama 3

15.0T

DeepSeek V3

14.8T

Parameters measure the size of the brain. Tokens measure the size of the experience. Both run into the trillions; they are not the same trillions.

Section 03

Why both numbers grew together

For a long time, parameter count was the dominant scaling story — bigger models were better models. That changed around 2022 with the Chinchilla scaling laws, which showed that for a fixed compute budget you wanted to balance parameters and tokens roughly proportionally. Spend the budget on a smaller model trained on more data, and you'd outperform a giant model trained on too little.

That's why headline numbers now come in pairs. "1.6T parameters trained on 15T tokens" tells you something useful. "1.6T parameters" alone tells you about capacity; you can't infer quality from it.

Section 04

The MoE asterisk on every recent headline

DeepSeek V3 reported 671B parameters. But it only uses about 37B of them per token. That's a Mixture-of-Experts model: the parameters are split into many "experts," and a router picks a small subset to fire for each token. Most of the model is dormant on any given forward pass.

This makes a single launch announcement carry three numbers, not one:

Total parameters — the brand number. How big the model "is."
Active parameters per token — the speed/cost number. What actually runs.
Training tokens — the quality number. How much it has read.

When a paper says "1.6T model, faster than a 70B dense model," check the active count. A 1.6T MoE that activates 37B per token costs about as much per inference as a 37B dense model — not a 1.6T one.

Mixture of experts · routing

8 of 256 active per token

Total params

671B

Brand number

Active per token

37B

Compute number

Compression

18×

Total ÷ active

In a Mixture-of-Experts model, only a small subset of "experts" fires for any given token. Total parameters is the headline; active parameters is what actually runs and what determines speed and cost.

Section 05

What this means for reading launch announcements

When the next "X trillion parameter model" headline lands, the questions to ask in order:

Is it dense or MoE? If MoE, total params is mostly marketing.
What's the active param count? That governs latency and cost.
How many training tokens? Below ~Chinchilla-optimal (≈20× active params), the model is undertrained.
What's the context window? Separate axis; orthogonal to model size, but hugely affects what the model can do.

Once you have those four, the headline number is just a flavour note.

Where this is used

The model comparator on the Ideas page uses these four columns. Every comparator number you see across Peregrinations links back to this primer if you click it.

Power Silicon Infra Models Apps

Back to ideas