Loading
Loading
How the model decides which words matter.
When reading a sentence, you focus on some words more than others.Attention is the same for AI - it lets each word "look at" every other word to figure out which ones matter for understanding the current word.
When processing "bank", the model looks back at all words to understand context
For each word, create a "question" - what am I looking for?
Compare each question with all other words' "answers" (keys)
Focus on the most relevant words, combine their meanings
The model has multiple "heads" to look at different parts of the sentence simultaneously. Like having 8 different readers analyzing the text for grammar, tone, facts, etc.
All heads share a single memory cache. Drastically reduces the memory needed, allowing the model to generate text much faster. 10x throughput.
A smart compromise. Heads are grouped into clusters. Retains the quality of the original with the speed of the modern era.
The model doesn't look at every single previous word. It only looks at the important ones, or a local window. Allows for massive book-sized inputs.
State Space Models (like Mamba). They digest the text in a single pass and forget the raw data, keeping only a "compressed state". Infinite context, constant memory.
Evolution of Efficiency