Loading
Loading
Three different numbers run into the trillions in every frontier-model launch announcement. They mean different things. This is the decoder.
DeepSeek V4 launches at 1.6T parameters. Anthropic's Mythos is rumoured at 10T parameters. Llama 3 was trained on 15T tokens. GPT-3 had 175B parameters and saw 300B tokens.
Three different units. They get reported with the same suffix (B, T) and they all sound like "size of the model." They are not the same.
A parameter is one learned number — a single weight inside one of the model's matrices. A 1.6T-parameter model has 1.6 trillion such numbers, each adjusted during training to encode some sliver of statistical pattern.
A token is one chunk of text — a word, a piece of a word, sometimes a single character. Tokenisers break input strings into tokens, and training-data sizes are measured in *how many* tokens the model has seen during training.
Mental model: parameters = size of the brain (how many tunable knobs). Training tokens = size of the experience (how much it has read).
For a long time, parameter count was the dominant scaling story — bigger models were better models. That changed around 2022 with the Chinchilla scaling laws, which showed that for a fixed compute budget you wanted to balance parameters and tokens roughly proportionally. Spend the budget on a smaller model trained on more data, and you'd outperform a giant model trained on too little.
That's why headline numbers now come in pairs. "1.6T parameters trained on 15T tokens" tells you something useful. "1.6T parameters" alone tells you about capacity; you can't infer quality from it.
DeepSeek V3 reported 671B parameters. But it only uses about 37B of them per token. That's a Mixture-of-Experts model: the parameters are split into many "experts," and a router picks a small subset to fire for each token. Most of the model is dormant on any given forward pass.
This makes a single launch announcement carry three numbers, not one:
When a paper says "1.6T model, faster than a 70B dense model," check the active count. A 1.6T MoE that activates 37B per token costs about as much per inference as a 37B dense model — not a 1.6T one.
When the next "X trillion parameter model" headline lands, the questions to ask in order:
Once you have those four, the headline number is just a flavour note.
The model comparator on the Ideas page uses these four columns. Every comparator number you see across Peregrinations links back to this primer if you click it.