
The Mathematics Behind LLM Thinking - Deep Dive into Transformer Operations
So you read Part 1 and thought “This is cool, but I want to see the actual math”? I get it. I was the same way. The conceptual understanding was fascinating, but I needed to see the equations, the matrix operations, the nitty-gritty details to really understand how this magic works.
Same disclaimer as before: I’m not a machine learning researcher or mathematician. I’m just someone who got really curious about how LLMs work and spent way too much time reading papers and implementing algorithms until things clicked. If you can do basic algebra and aren’t afraid of some matrix multiplication, you can follow along.
Fair warning: we’re about to get technical. Really technical. But if you’re curious about the mathematical foundations that power ChatGPT, Claude, and every other Large Language Model you interact with, buckle up. This is going to be a fun ride.
The Mathematical Foundation
Remember in Part 1 when I mentioned that LLM responses emerge from “mathematical operations”? Well, let me show you exactly what I meant. The transformer architecture we’re exploring was introduced in the seminal “Attention Is All You Need” paper [1], which completely revolutionized how we think about language models.
At its core, every single LLM response - every witty comeback, every helpful explanation, every creative story - is the result of carefully orchestrated matrix operations and non-linear transformations. It’s honestly mind-blowing when you see it laid out mathematically.
Let’s trace through this mathematical journey step by step.
Phase 1: From Text to Tensors
The Challenge: How Do We Represent Words Mathematically?
Before we dive into the math, let’s understand the fundamental problem: computers can only work with numbers, but language is made of words. How do we bridge this gap?
The naive approach might be to assign each word a single number (cat=1, dog=2, etc.), but this loses all semantic meaning. We need a representation that captures relationships: “king” should be closer to “queen” than to “banana.”
Token Embedding Mathematics: The Lookup Table Approach
The solution is surprisingly elegant. We create a massive lookup table (called an embedding matrix) where each word gets represented by a vector of hundreds of numbers. Here’s how it works:
Step 1: The Embedding Matrix
Embedding_Matrix = [
[0.1, -0.3, 0.7, ...], # Vector for token 0
[0.4, 0.2, -0.1, ...], # Vector for token 1
[0.8, -0.5, 0.3, ...], # Vector for token 2
...
[-0.2, 0.6, 0.9, ...] # Vector for token 151935
]
This matrix has shape [vocab_size, hidden_dim]
- in our case [151936, 896]
. That’s 151,936 different tokens, each represented by 896 numbers.
Step 2: The Lookup Operation When we convert tokens to embeddings, we’re performing a simple lookup:
def embed_tokens(token_ids, embedding_matrix):
# For each token ID, grab its corresponding row from the matrix
embeddings = []
for token_id in token_ids:
embeddings.append(embedding_matrix[token_id])
return embeddings
Mathematically: E = Embedding_Matrix[token_ids]
Step 3: Concrete Example For “What is the meaning of life?”:
# After tokenization
tokens = [3923, 374, 279, 7438, 315, 2324, 30]
# Embedding lookup
E[0] = Embedding_Matrix[3923] # "What" → [0.1234, -0.5678, 0.9012, ...]
E[1] = Embedding_Matrix[374] # "is" → [0.2345, -0.6789, 0.0123, ...]
E[2] = Embedding_Matrix[279] # "the" → [0.3456, -0.7890, 0.1234, ...]
...
# Final result: [7, 896] tensor (7 tokens, 896 dimensions each)
Why 896 Dimensions? This isn’t arbitrary! The embedding dimension is chosen to balance:
- Expressiveness: More dimensions = more nuanced representations
- Efficiency: Fewer dimensions = faster computation
- Architecture constraints: Must match the model’s hidden size
The 896 dimensions allow the model to encode complex relationships. For example:
- Dimensions 1-100 might encode “semantic category” (animal, object, action)
- Dimensions 101-200 might encode “emotional valence” (positive, negative, neutral)
- Dimensions 201-300 might encode “grammatical role” (noun, verb, adjective)
- And so on…
The Position Problem: Words Need Context
Here’s a critical issue with our embedding approach so far: the word “bank” has the same representation whether it appears in “river bank” or “money bank.” But position matters! “The cat chased the dog” means something very different from “The dog chased the cat.”
We need to encode not just what each word means, but where it appears in the sequence.
Positional Encoding: RoPE Mathematics
The Old Way: Sinusoidal Encoding Traditional transformers added position information like this:
final_embedding = word_embedding + position_embedding
But this has problems - it doesn’t preserve relative distances well and doesn’t extrapolate to longer sequences.
The New Way: Rotary Position Embedding (RoPE) Modern models use RoPE [2], which is mathematically elegant. Instead of adding position info, we rotate the embedding vectors in high-dimensional space.
The Intuition Behind RoPE Imagine you’re on a clock face. If you’re at 3 o’clock and I’m at 6 o’clock, the angle between us is always 90 degrees, regardless of how we rotate the entire clock. RoPE uses this property - it encodes relative positions as rotations.
The Mathematics For each pair of dimensions in our embedding, we apply a rotation:
def rope_rotation(x, position, dim_index, total_dims):
# Calculate the rotation frequency for this dimension pair
freq = 1.0 / (10000 ** (2 * dim_index / total_dims))
# Calculate the rotation angle
angle = position * freq
# Apply 2D rotation matrix
cos_angle = math.cos(angle)
sin_angle = math.sin(angle)
# Rotate the pair of dimensions
x_rotated = [
x[2*dim_index] * cos_angle - x[2*dim_index + 1] * sin_angle,
x[2*dim_index] * sin_angle + x[2*dim_index + 1] * cos_angle
]
return x_rotated
The Complete RoPE Formula
RoPE(x, pos) = [
x₁ cos(θ₁) - x₂ sin(θ₁),
x₁ sin(θ₁) + x₂ cos(θ₁),
x₃ cos(θ₂) - x₄ sin(θ₂),
x₃ sin(θ₂) + x₄ cos(θ₂),
...
]
Where θᵢ = pos / 10000^(2i/d)
and:
pos
is the position in the sequence (0, 1, 2, …)i
is the dimension pair index (0, 1, 2, …)d
is the embedding dimension (896 in our case)
Why This Works The key insight is that when we compute attention between positions, the dot product naturally captures the relative distance:
attention_score = RoPE(query, pos_i) · RoPE(key, pos_j)
Due to trigonometric identities, this becomes a function of (pos_i - pos_j)
- exactly what we want for relative position encoding!
Phase 2: The Attention Mechanism Deep Dive
The Fundamental Problem: How Do We Focus?
When you read “The cat that lived in the house with the red door was sleeping,” your brain automatically knows that “was sleeping” refers to “the cat,” not “the house” or “the door.” This is selective attention - focusing on relevant parts while ignoring irrelevant ones.
Traditional neural networks process each word independently, losing this crucial context. The attention mechanism solves this by letting each word “look at” and “focus on” other words in the sentence.
The Intuition: Queries, Keys, and Values
Think of attention like a database lookup:
- Query: “What am I looking for?” (the current word asking for context)
- Key: “What do I represent?” (each word’s identifier)
- Value: “What information do I contain?” (each word’s actual content)
For each word, we ask: “Given what I’m looking for (query), which other words (keys) are most relevant, and what should I extract from them (values)?”
Multi-Head Attention Mathematics
The core of transformer intelligence lies in the attention mechanism. Let me break down exactly how we derive each step:
Step 1: Creating Queries, Keys, and Values
We start with our embedded sequence X
with shape [seq_len, hidden_dim]
(in our case [7, 896]
).
To create our Q, K, V matrices, we multiply by learned weight matrices:
# These weight matrices are learned during training
W_Q = random_matrix([896, 896]) # Query projection
W_K = random_matrix([896, 896]) # Key projection
W_V = random_matrix([896, 896]) # Value projection
# Linear transformations
Q = X @ W_Q # [7, 896] @ [896, 896] = [7, 896]
K = X @ W_K # [7, 896] @ [896, 896] = [7, 896]
V = X @ W_V # [7, 896] @ [896, 896] = [7, 896]
Why do we need these projections? We could use the embeddings directly, but the projections allow the model to learn different “views” of each word:
- Query projection: “What aspects should I focus on when looking for context?”
- Key projection: “What aspects of me are most important for others to see?”
- Value projection: “What information should I actually contribute?”
Step 2: Multi-Head Reshaping - Why Split Into Multiple “Heads”?
Here’s a key insight: instead of having one massive attention mechanism, we split it into multiple smaller “heads.” Think of it like having multiple experts, each focusing on different aspects of the relationships.
# Split the 896 dimensions into 14 heads of 64 dimensions each
num_heads = 14
head_dim = 896 // 14 # = 64
# Reshape from [seq_len, hidden_dim] to [seq_len, num_heads, head_dim]
Q = Q.reshape([7, 14, 64]) # 7 tokens, 14 heads, 64 dims per head
K = K.reshape([7, 14, 64])
V = V.reshape([7, 14, 64])
Why multiple heads? Each head can learn to focus on different types of relationships:
- Head 1: Subject-verb relationships
- Head 2: Adjective-noun relationships
- Head 3: Long-distance dependencies
- Head 4: Syntactic structure
- etc.
This is like having 14 different specialists analyzing the same sentence from different perspectives.
Step 3: Scaled Dot-Product Attention - The Heart of the Mechanism
Now comes the magic. For each head, we compute attention using the famous formula:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Let me show you exactly how we derive this step by step:
Step 3a: Compute Similarity Scores
# For each query, compute similarity with all keys
# Q shape: [7, 64], K shape: [7, 64]
scores = Q @ K.transpose(-1, -2) # [7, 64] @ [64, 7] = [7, 7]
This gives us a 7×7 matrix where scores[i][j]
tells us how much token i should pay attention to token j.
Step 3b: Scale the Scores
scores = scores / math.sqrt(head_dim) # Divide by √64 = 8
Why do we scale? Without scaling, the dot products can become very large, pushing the softmax into regions where gradients vanish. The √d_k scaling keeps the values in a reasonable range.
Step 3c: Apply Causal Mask (for autoregressive models)
# Create a mask so tokens can't see future tokens
mask = torch.triu(torch.ones(7, 7), diagonal=1) * -float('inf')
scores = scores + mask
This ensures that when generating text, each token can only attend to previous tokens, not future ones.
Step 3d: Convert to Probabilities
attention_weights = torch.softmax(scores, dim=-1) # [7, 7]
Now each row sums to 1, giving us a probability distribution over which tokens to attend to.
Step 3e: Apply Attention to Values
output = attention_weights @ V # [7, 7] @ [7, 64] = [7, 64]
This weighted combination gives us the final attended representation for each token.
Step 4: Concatenate and Project
# Concatenate all heads
multi_head_output = concat(head_outputs, dim=-1) # [seq_len, hidden_dim]
# Final linear projection
attention_output = multi_head_output @ W_O
Why This Works: The Intuition Behind the Math
Here’s what clicked for me when I finally understood attention: it’s essentially computing a weighted average of all positions, where the weights are determined by how “relevant” each position is to the current position.
Think about it this way - when you read the sentence “The cat sat on the mat,” and you’re processing the word “sat,” your brain automatically knows that “cat” is more relevant than “the” or “on.” The attention mechanism does exactly this, but mathematically. The softmax ensures these weights sum to 1, creating a probability distribution.
It’s like the LLM is constantly asking: “Of all the words I’ve seen so far, which ones should I pay attention to right now?”
Phase 3: Feed-Forward Network Mathematics
SwiGLU Activation Function
Modern transformers use SwiGLU activation [3] instead of traditional ReLU:
\[\text{SwiGLU}(x) = \text{Swish}(xW_1 + b_1) \odot (xW_2 + b_2)\]Where Swish is defined as: \(\text{Swish}(x) = x \cdot \sigma(\beta x) = x \cdot \frac{1}{1 + e^{-\beta x}}\)
The complete feed-forward computation:
# Gate and up projections
gate = X @ W_gate + b_gate # [seq_len, ff_dim]
up = X @ W_up + b_up # [seq_len, ff_dim]
# SwiGLU activation
activated = swish(gate) * up # Element-wise multiplication
# Down projection
ff_output = activated @ W_down + b_down # [seq_len, hidden_dim]
Phase 4: Normalization Mathematics
RMSNorm vs LayerNorm
Traditional LayerNorm: \(\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta\)
RMSNorm (used in modern models) [4]: \(\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\)
Where: \(\text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2}\)
RMSNorm is simpler and often performs just as well, requiring fewer computations.
Phase 5: Output Generation Mathematics
Logits Computation
# Final layer normalization
normalized = RMSNorm(final_hidden_state)
# Project to vocabulary size
logits = normalized @ output_projection # [seq_len, vocab_size]
Advanced Sampling Mathematics
Temperature Scaling: \(p_i = \frac{e^{z_i/T}}{\sum_{j=1}^{V} e^{z_j/T}}\)
Where T
is temperature and z_i
are the raw logits.
Top-K Sampling:
- Sort logits in descending order
- Keep only top K values
- Set others to -∞
- Apply softmax and sample
Nucleus (Top-P) Sampling:
- Sort logits in descending order
- Find smallest set where cumulative probability ≥ P
- Sample from this subset
Quantization Mathematics
4-bit Quantization (Q4_K_M)
The quantization process converts 32-bit floats to 4-bit integers:
# Quantization
scale = max(abs(weights)) / 7 # 4-bit signed: -7 to 7
quantized = round(weights / scale).clamp(-7, 7)
# Dequantization
dequantized = quantized * scale
This achieves ~8x compression with minimal quality loss.
Performance Optimization Mathematics
KV Caching
Instead of recomputing attention for all previous tokens:
# First token
K_cache = [K_0]
V_cache = [V_0]
# Second token
K_cache = concat([K_cache, K_1]) # Append new key
V_cache = concat([V_cache, V_1]) # Append new value
# Attention computation reuses cached values
attention_scores = Q_1 @ K_cache.T
This reduces computation from O(n²) to O(n) for each new token.
SIMD Vectorization
Modern implementations use SIMD instructions for parallel computation:
# Instead of:
for i in range(len(vector)):
result[i] = vector1[i] * vector2[i]
# Use vectorized operations:
result = vector1 * vector2 # Processes 8+ elements simultaneously
The Complete Mathematical Pipeline
Putting it all together, here’s the mathematical flow for one transformer layer:
def transformer_layer(x, attention_weights, ff_weights):
# 1. Multi-head attention
residual = x
x = RMSNorm(x)
x = multi_head_attention(x, attention_weights)
x = x + residual # Residual connection
# 2. Feed-forward network
residual = x
x = RMSNorm(x)
x = swiglu_ffn(x, ff_weights)
x = x + residual # Residual connection
return x
# Complete model
def transformer_model(tokens, all_weights):
x = embedding_lookup(tokens)
for layer_weights in all_weights:
x = transformer_layer(x, layer_weights)
x = RMSNorm(x)
logits = x @ output_projection
return logits
Computational Complexity Analysis
Memory Complexity:
- Model weights: O(layers × hidden_dim²)
- KV cache: O(seq_len × layers × hidden_dim)
- Activations: O(seq_len × hidden_dim)
Time Complexity:
- Attention: O(seq_len² × hidden_dim)
- Feed-forward: O(seq_len × hidden_dim²)
- Total per token: O(seq_len × hidden_dim² + seq_len² × hidden_dim)
Research Foundations
The mathematical techniques we’ve explored build upon decades of research:
- Attention Mechanism: Introduced in “Attention Is All You Need” [1]
- RoPE: “RoFormer: Enhanced Transformer with Rotary Position Embedding” [2]
- SwiGLU: “GLU Variants Improve Transformer” [3]
- RMSNorm: “Root Mean Square Layer Normalization” [4]
- Quantization: Recent advances in “LLM.int8()” [5] and “QLoRA” [6]
Practical Implementation Insights
Memory Management:
# Gradient checkpointing for memory efficiency
def checkpoint_layer(layer, *args):
return torch.utils.checkpoint.checkpoint(layer, *args)
Numerical Stability:
# Prevent overflow in softmax
def stable_softmax(x):
x_max = x.max(dim=-1, keepdim=True)[0]
exp_x = torch.exp(x - x_max)
return exp_x / exp_x.sum(dim=-1, keepdim=True)
The Beauty of Emergence
Here’s what still amazes me every time I think about it: all this mathematics - these matrix multiplications, attention mechanisms, normalization functions - somehow gives rise to behavior that feels genuinely intelligent and creative.
There’s no “creativity module” or “understanding subroutine.” The LLM’s ability to write poetry, solve problems, and hold conversations emerges purely from the complex interplay of these mathematical operations across billions of parameters. It’s like consciousness arising from the firing of neurons, but in mathematical form.
When I first implemented these algorithms myself and saw them actually generate coherent text, I had this moment of “Holy shit, it actually works!” The math we just walked through - that’s literally what’s happening every time you chat with an LLM.
This mathematical foundation enables the conversational LLMs we explored in Part 1. If you haven’t read that yet, it provides the conceptual framework that makes all this math meaningful. And if you want to dive even deeper, I’m planning future posts on training mathematics, optimization techniques, and how to implement these algorithms efficiently.
References
[1] Vaswani, A., et al. “Attention is all you need.” NIPS 2017. [2] Su, J., et al. “RoFormer: Enhanced transformer with rotary position embedding.” arXiv:2104.09864, 2021. [3] Shazeer, N. “GLU variants improve transformer.” arXiv:2002.05202, 2020. [4] Zhang, B., & Sennrich, R. “Root Mean Square Layer Normalization.” NIPS 2019. [5] Dettmers, T., et al. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” NeurIPS 2022. [6] Dettmers, T., et al. “QLoRA: Efficient finetuning of quantized llms.” arXiv:2305.14314, 2023.
Have you implemented any of these algorithms yourself? What aspects of the mathematics do you find most interesting or challenging? And if you made it through all this math - seriously, congratulations! You now understand the mathematical foundations of modern AI better than 99% of people. That’s pretty awesome.
Want to start from the beginning? Check out Part 1: How AI Models Actually Think for the conceptual foundation that makes all this math meaningful.