Tokens to Vectors: How LLMs Process Language

A visual guide to the mathematics behind attention and summarisation

Stage 1 — Tokenisation to Embedding

"cat"
4,096-dimensional vector
0.234 -0.871 0.125 0.456 -0.332 0.089 ... 4,090 more
Each dimension captures a learned semantic feature

Words don't enter the model as text. Each token is immediately converted into a vector — a list of thousands of numbers representing its position in high-dimensional space. These numbers are learned during training and encode relationships between concepts.

Stage 2 — Semantic Vector Space

dimension 847 → dimension 2,103 →
cat
dog
pet
money
deposit
loan
river
water
stream
bank
democracy
animals finance nature

Similar concepts cluster together in vector space. "Cat" and "dog" are close; "cat" and "democracy" are far apart. Ambiguous words like "bank" sit between their possible meanings — until context pulls them toward one cluster or another.

Stage 3 — Attention Mechanism

The fish swam to the bank of the river
The 0.02
fish 0.15
swam 0.11
to 0.01
the 0.01
bank
of 0.03
the 0.01
river 0.66

When processing "bank," the attention mechanism computes similarity between bank's Query vector and every other token's Key vector. High similarity = high attention weight. "River" gets 0.66 attention; "the" gets almost none. These weights determine how much each token's Value vector contributes to bank's updated representation.

Stage 4 — Contextual Vector Shift

Financial meaning
Riverbank meaning
river (0.66)
fish (0.15)
bank (initial)
bank (after attention)

After attention, "bank" isn't in the same place in vector space. The weighted contribution of "river" (and "fish," "swam") has pulled the vector toward the nature cluster. It's not that information was added to a packet — the vector itself was warped by gravitational influence from contextually relevant tokens.