The Magic Behind the Curtain: How LLMs Actually Generate Text

📅 31/08/2025

Open In Colab

Ever wondered what happens when you type “The cat is” and an AI completes it with something like “sitting on the windowsill”? You’re about to peek behind the curtain and see exactly how Large Language Models think, predict, and generate text - one token at a time.

What We’ll Discover

In this hands-on journey, we’ll build our own mini text generator and watch it work in real-time. You’ll see: - How text becomes numbers (and back again) - Why the model considers 49,152 possibilities for every single word - How controlled randomness prevents boring, repetitive responses - The simple loop that powers every AI conversation

Ready to demystify the magic? Let’s dive in!

setup

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'HuggingFaceTB/SmolLM2-135M-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Tokenization

Tokenization is a fundamental step in text processing and natural language processing (NLP). A tokenizer breaks down text into smaller, meaningful units called “tokens.” Think of it like taking a sentence and splitting it into individual words, or even smaller pieces depending on your needs. For example, the sentence “Hello, world!” might be tokenized into:

  1. [“Hello”, “,”, “world”, “!”] (word-level tokens)
  2. Or even [“Hel”, “lo”, “,”, “wor”, “ld”, “!”] (subword tokens)
text = "The cat is "
tokens = tokenizer(text)
tokens
{'input_ids': [504, 2644, 314, 216], 'attention_mask': [1, 1, 1, 1]}
  • input_ids: These are the numerical IDs that represent each token - this is what the model actually processes
  • attention_mask: Other inputs required for the model, let’s ignore it for now
# to get back token str from the tokens
[tokenizer.decode([token_id]) for token_id in tokens['input_ids']]
['The', ' cat', ' is', ' ']

Token breakdown: 1. 'The', ' cat', ' sat', ' ': other words and their corresponding representations. We can see that the tokenizer doesn’t just split on spaces. It learns patterns from training data, so spaces become part of tokens (except the first word).

Why Subword Tokenization? You might wonder why “cat” becomes ” cat” (with a space). Modern tokenizers use “subword” tokenization - they learn common patterns from millions of texts. A space before a word often signals it’s a separate concept, so the tokenizer treats ” cat” as one unit. This helps the model understand word boundaries and context better than just splitting on spaces.

print(tokenizer.special_tokens_map)
print(tokenizer.vocab_size)
{'bos_token': '<|im_start|>', 'eos_token': '<|im_end|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|im_end|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}
49152

There are 49152 unique tokens are there in the tokenizer.

The Foundation: From Words to Numbers

The Model’s Internal Thinking Process When you see those logits numbers, you’re literally looking at the model’s “thoughts”! Each position in our input gets its own set of predictions. The model isn’t just guessing the next word - it’s considering what could come after EVERY position. But for text generation, we only care about the very last position (after ” is”).

We have converted the text to tokens and will use them to get the next token from the model.

import torch
input_tensor = torch.tensor([tokens['input_ids']]) # as the model needs ptorch tensor as input
op = model(input_tensor)

We have to focus on logits for now and can ignore all other parts.

op.logits
tensor([[[16.3669,  8.8479, 12.6537,  ..., 10.0735, 13.1624,  5.4319],
         [19.1141, 12.7332, 15.7778,  ..., 20.1228, 19.1875, 10.3858],
         [10.1939,  1.3824,  4.2152,  ..., 13.2681, 12.5120,  2.1723],
         [16.5672,  7.1412, 11.3807,  ..., 14.1004, 13.9356,  7.9265]]],
       grad_fn=<UnsafeViewBackward0>)
op.logits.shape
torch.Size([1, 4, 49152])

Why So Many Numbers? 49,152 might seem like overkill, but remember - the model has to consider EVERY possible token it knows. This includes common words like “happy”, rare words like “sesquipedalian”, numbers, punctuation, and even tokens from other languages. Most will have very low scores, but the model still evaluates them all.

print(f"Logits shape: {op.logits.shape}")
print("Shape breakdown: [Batch_size, Sequence_length, Vocab_size]")
print(f"[{op.logits.shape[0]}, {op.logits.shape[1]}, {op.logits.shape[2]}]")
print(f"- Batch: {op.logits.shape[0]} text(s) processed")
print(f"- Sequence: {op.logits.shape[1]} tokens in input") 
print(f"- Vocab: {op.logits.shape[2]:,} possible next tokens")
Logits shape: torch.Size([1, 4, 49152])
Shape breakdown: [Batch_size, Sequence_length, Vocab_size]
[1, 4, 49152]
- Batch: 1 text(s) processed
- Sequence: 4 tokens in input
- Vocab: 49,152 possible next tokens

49,152 Possibilities: Watching the Model Think From Raw Scores to Probabilities

  1. Converting logits to probabilities using softmax
  2. Demonstrating why we can’t just pick the highest logit every time
  3. A simple example of sampling vs greedy selection
# Extract logits for the last token position (where next token will be predicted)
last_token_logits = op.logits[:, -1, :]  # Shape: [1, 262144]

# Find the token with highest probability (greedy selection)
predicted_token_id = last_token_logits.argmax(dim=-1)  # Gets index of max value

# convert the id to token
next_token = tokenizer.decode(predicted_token_id)

print(f"predicted_token_id : {predicted_token_id.item()}")
print(f"next token : `{next_token}`")
predicted_token_id : 33
next token : `1`

The Art of Selection: Why Randomness Matters

So far, we’ve done text to predictions. But here’s the thing - if we always select the token with the highest logit score (greedy selection), our model becomes predictable and boring. It’s like having a conversation with someone who always gives the most obvious response!

This is where sampling comes in. Instead of always picking the #1 choice, large language models introduce some randomness by selecting from the top-K highest scoring tokens, where K is a number we can control.

There are two parameters are used, below are 1. Temperature is like a creativity dial: 0.1 = very predictable, 1.5 = very creative 1. Top-k means ‘only consider the k most likely tokens’ - saves computation and improves quality

import torch.nn.functional as F

def generate_next_token(text, temperature=1.0, top_k=50):
    """Simple function to show one step of text generation"""
    # Tokenize input
    tokens = tokenizer(text, return_tensors="pt")
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(**tokens)
    
    # Get logits for next token prediction
    next_token_logits = outputs.logits[0, -1, :] / temperature
    
    # Get top-k most likely tokens
    top_logits, top_indices = torch.topk(next_token_logits, top_k)
    
    # Convert to probabilities and sample 
    probs = F.softmax(top_logits, dim=-1)

    # randomly sample from all the probabilities
    next_token_idx = torch.multinomial(probs, 1)
    
    next_token_id = top_indices[next_token_idx]
    
    return tokenizer.decode(next_token_id)
[generate_next_token("The cat is", 0.7) for _ in range(4)]
[' very', ' a', ' standing', ' also']

Notice how we get different tokens each time? That’s the beauty of sampling - it prevents boring, repetitive text! Think of temp as a awesome lever to select random tokens where is top_p is for setting up the window to consider.

Putting It All Together: Building Our Generator

We have to add in new tokens to the end of the text and pass it to the model.

def generate_text(prompt, max_tokens=10):
    current_text = prompt
    for i in range(max_tokens):
        next_token = generate_next_token(current_text, temperature=0.7)
        current_text += next_token
        print(f"Step {i+1}: {current_text}")
    return current_text
generate_text(text, 4)
Step 1: The cat is 1
Step 2: The cat is 10
Step 3: The cat is 100
Step 4: The cat is 100%
'The cat is 100%'

The Magic Revealed: What We’ve Learned

Congratulations! You’ve just built and understood the core engine that powers every large language model. What seemed like magic is actually a surprisingly elegant process:

The Complete Picture:

  1. Text becomes numbers - Tokenization converts human language into mathematical representations
  2. Pattern recognition at scale - The model evaluates 49,152 possibilities for every single prediction
  3. Controlled randomness - Temperature and top-k sampling prevent boring, repetitive outputs
  4. Iterative generation - This simple loop repeats to create coherent, contextual text

Why This Matters: Every time you chat with ChatGPT, Claude, or any AI assistant, this exact process runs behind the scenes. The model isn’t “thinking” in human terms - it’s performing incredibly sophisticated pattern matching based on billions of text examples it learned from.

The Bigger Picture: This same fundamental process scales from our tiny 135M parameter model to massive systems with hundreds of billions of parameters. The core loop remains the same: predict, sample, add, repeat.

Understanding this gives you insight into why LLMs sometimes hallucinate (they’re optimizing for plausible patterns, not truth), why they can be creative (controlled randomness), and why context matters so much (each prediction builds on everything before it).