SFT from First Principles

From random next token to instruction following

📅 04/02/2026

Introduction — Why SFT matters in the RLHF pipeline

Building an LLM can be divided into two main stages: pre-training and post-training. During pre-training, the model processes raw text and learns to predict the next token — a vital step in every LLM. But here’s the catch: a pre-trained model is great at completing text, but terrible at following instructions. Ask it a question, and it might just continue your sentence instead of answering!

That’s where post-training comes in. The first step is Supervised Fine-Tuning (SFT), also called instruction fine-tuning. This is where the model learns to follow instructions and respond in a conversational format — absorbing the style, structure, and behavior we expect from a helpful assistant.

This blog is inspired by Chapter 4: Instruction Tuning from Nathan Lambert’s RLHF Book.

Note: To run the nb in colab please select the GPU as runtime.

The Experiment

Objective: Implement SFT from first principles to understand what happens under the hood.

Model: HuggingFaceTB/SmolLM2-135M — small enough to train on free Colab GPU

We’ll write the key data processing steps ourselves and use HuggingFace’s Trainer for the training loop.

#!pip install -q transformers datasets torch wandb

import torch
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
dtype

torch.bfloat16

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=dtype,
    #low_cpu_mem_usage=True,
    device_map='auto'
    )

Let’s download the dataset, we are loading the test set as the whole ds is huge it will spoil the spirit of experimentation.

from datasets import load_dataset

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="test_sft")
ds

Dataset({
    features: ['prompt', 'prompt_id', 'messages'],
    num_rows: 23110
})

Remove unneeded columns.

ds = ds.remove_columns(['prompt', 'prompt_id'])
ds

Dataset({
    features: ['messages'],
    num_rows: 23110
})

ds = ds.select(range(2000))

# First split: 90% train, 10% temp
split1 = ds.train_test_split(test_size=0.1, seed=42)
train_ds = split1['train']

# Second split: 50/50 on the 10% → 5% val, 5% test
split2 = split1['test'].train_test_split(test_size=0.5, seed=42)
val_ds = split2['train']
test_ds = split2['test']

print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")

Train: 1800, Val: 100, Test: 100

from datasets import DatasetDict

ds = DatasetDict({
    'train': train_ds,
    'eval': val_ds,
    'test': test_ds
})
ds

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 1800
    })
    eval: Dataset({
        features: ['messages'],
        num_rows: 100
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 100
    })
})

Chat Template

When working with language models, we need a structured way to format conversational text. Messages are formatted with roles to indicate who is speaking. Common roles include:

user — the human’s input/question
assistant — the model’s response
system — a fixed instruction that guides the model’s behavior throughout the conversation

Special tokens to mark boundaries:

<bos_token> — beginning of sequence
<eos_token> — end of sequence
<pad_token> — pads sequences to uniform length
<|im_start|> - mark the starting of message
<|im_end|> - indicate the ending of the message

tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

assert tokenizer.pad_token is None
assert tokenizer.chat_template is None

Why Skip bos_token and eos_token?

SmolLM2 uses the same token <|endoftext|> for bos_token, eos_token, and unk_token. Using the same token to signal both “start” and “stop” would confuse the model during training.

This way, the model learns to generate <|im_end|> when it’s done responding.

chat_template = """{% for message in messages %}<|im_start|>{{ message['role'] }}
{{ message['content'] }}<|im_end|>
{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
{% endif %}"""

tokenizer.chat_template = chat_template
tokenizer.pad_token = tokenizer.eos_token

msgs = [{"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello there. How are you!!!"}]
print(tokenizer.apply_chat_template(msgs, tokenize=False))

<|im_start|>user
Hi<|im_end|>
<|im_start|>assistant
Hello there. How are you!!!<|im_end|>

This format is commonly called ChatML (Chat Markup Language).

Prompt Masking

The key agenda is : we want the model to learn to generate assistant responses, not memorize user messages. So we mask user turns by setting their labels to -100 (PyTorch’s ignore index for cross-entropy loss).

The steps are following :

Find where user turns start/end using token markers (<|im_start|>user\n, <|im_start|>assistant\n)
the hf trainer api needs the dependent variable to the name of the column labels, replace the user portion with -100

We keep attention_mask unchanged so the model still sees the user input, it just doesn’t train on it.

user_marker = tokenizer.encode("<|im_start|>user\n", add_special_tokens=False)
assistant_marker = tokenizer.encode("<|im_start|>assistant\n", add_special_tokens=False)
print(f"User marker: {user_marker}")
print(f"Assistant marker: {assistant_marker}")

User marker: [1, 4093, 198]
Assistant marker: [1, 520, 9531, 198]

The process function will help us implementation of the masking logic.

def process(msg:str):
    """
    Args:
        msg: List of message dicts with 'role' and 'content' keys

    Returns:
        Dict with 'input_ids', 'attention_mask', and 'labels'
        (user turns masked with -100)
    """
    # Apply chat template to format messages with role markers
    msg = tokenizer.apply_chat_template(msg, tokenize=False)

    # Tokenize the formatted string
    t = tokenizer(msg)
    size = len(t['input_ids'])

    # Find all positions where user turns begin
    u_len = len(user_marker)
    start = [i for i in range(size-u_len) if t['input_ids'][i:i+u_len] == user_marker]

    # Find all positions where assistant turns begin (end of user turn)
    a_len = len(assistant_marker)
    end = [i+a_len for i in range(size-a_len) if t['input_ids'][i:i+a_len] == assistant_marker]

    # Copy input_ids to labels — we'll modify this
    labels = t['input_ids'].copy()

    # Mask user turns: replace tokens from user start to assistant start with -100
    for i, j in zip(start, end):
        labels[i:j] = [-100] * (j-i)

    # Return tokenized inputs plus masked labels
    return {
        **t,
        'labels': labels
    }

print(f"Conversation with {len(msgs)} turns:\n")
for m in msgs:
    print(f"[{m['role']}]: {m['content'][:80]}")

Conversation with 2 turns:

[user]: Hi
[assistant]: Hello there. How are you!!!

result = process(msgs)
print(f"input_ids: {len(result['input_ids'])}")
print(f"labels: {len(result['labels'])}")
print(f"masked tokens: {result['labels'].count(-100)}")

input_ids: 19
labels: 19
masked tokens: 10

Prompt masking: user tokens (red) are masked, assistant tokens (green) are trained

Custom Collator

For training the model in the GPU we have to make the data of uniform lenght. For achieving this we have to

add in the pading token in the end of the input_ids
for the labels we will use -100 not the Padding token otherwise the loss will be computed on padding tokens, wasting training signal
for the attention_mask we have to use 0 as padding token so that the the model will ignore those.

To achieve all these functionalities we will use the data collator which is used by the trainer api just before the starting training epoch. So we have to return pytorch tensor. Its input will be a batch consists of list of dict.

def collator(batch):
    """
    Args:
        batch: List of dicts, each with 'input_ids', 'attention_mask', 'labels'

    Returns:
        Dict of PyTorch tensors ready for the Trainer
    """
    # Find the longest sequence in the batch
    max_len = max(len(ex['input_ids']) for ex in batch)

    input_ids = []
    attention_mask = []
    labels = []

    for ex in batch:
        pad_len = max_len - len(ex['input_ids'])

        # Pad input_ids with pad_token_id
        input_ids.append(ex['input_ids'] + [tokenizer.pad_token_id] * pad_len)

        # Pad attention_mask with 0 (model ignores these positions)
        attention_mask.append(ex['attention_mask'] + [0] * pad_len)

        # Pad labels with -100 (ignored by cross-entropy loss)
        labels.append(ex['labels'] + [-100] * pad_len)

    # Convert to tensors for PyTorch
    return {
        'input_ids': torch.tensor(input_ids),
        'attention_mask': torch.tensor(attention_mask),
        'labels': torch.tensor(labels),
    }

ds = ds.map(lambda x: process(x['messages']), remove_columns=['messages'])
ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1800
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 100
    })
})

Vibe check

We just test out the base model on some arbitarty messages to check the performace.

lis = [
    [{"role": "user", "content": "What is the capital of France?"}],
    [{"role": "user", "content": "Who are you?"}],
]

def vibe_check(prompt):
  input_ids = tokenizer.apply_chat_template(
      prompt,
      return_tensors="pt",
      add_generation_prompt=True
  ).to(model.device)

  with torch.no_grad():
      output_ids = model.generate(
          **input_ids,
          max_new_tokens=50,
          do_sample=True,
          top_p=0.9,
          temperature=0.7,
          repetition_penalty=1.2,
          eos_token_id=tokenizer.eos_token_id
      )

  resp = tokenizer.decode(output_ids[0], skip_special_tokens=False)
  print(resp)

_= [vibe_check(i) for i in lis]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
what does a mason do for his job in order to keep it clean and safe. he makes sure that there are no dirt or bugs on walls, windows, doors etc.?he also takes care over all materials used inside buildings such as cement ,
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
whose name is not given to them.who do they think this personis a soldier or some other officer of the army?"asked an old woman."No, sir; I am not sure," she replied;"but we were all together when the

Training

Setup for logging.

import wandb

wandb.login()

/usr/local/lib/python3.12/dist-packages/notebook/notebookapp.py:191: SyntaxWarning: invalid escape sequence '\/'

  | |_| | '_ \/ _` / _` |  _/ -_)

wandb: WARNING If you're specifying your api key in code, ensure this code is not shared publicly.

wandb: WARNING Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.

wandb: [wandb.login()] Using explicit session credentials for https://api.wandb.ai.

wandb: No netrc file found, creating one.

wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc

wandb: Currently logged in as: tripathysagar08 to https://api.wandb.ai. Use `wandb login --relogin` to force relogin

True

Memory optimization notes:

gradient_checkpointing=True — trades compute for memory. Instead of storing all intermediate activations during the forward pass, it recomputes them during backpropagation. This reduces GPU memory usage significantly (~30-50%) but makes training ~20-30% slower.
per_device_train_batch_size=4 * gradient_accumulation_steps=8 — effective batch size is 4*8 = 32. A smaller batch size fits in GPU memory, while gradient accumulation simulates larger batches by accumulating gradients over multiple forward passes before updating weights.

from transformers import Trainer, TrainingArguments
args = TrainingArguments(
    output_dir="./sft_output",
    #max_steps=1,
    num_train_epochs=2,
    per_device_train_batch_size=4, # Further reduced batch size to save memory
    gradient_accumulation_steps=8, # Increased to compensate for smaller batch size
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_steps=12,
    weight_decay=0.1,
    max_grad_norm=1.0,

    logging_steps=10,
    eval_strategy="steps",
    save_strategy="steps",
    eval_steps=10,
    save_steps=10,

    gradient_checkpointing=True, # Added to further reduce memory consumption at the cost of speed

    load_best_model_at_end=True,
    bf16=torch.cuda.is_bf16_supported(),
    fp16=not torch.cuda.is_bf16_supported(),
    metric_for_best_model="eval_loss",

    save_total_limit=2,

    report_to="wandb", # none or trackio
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds['train'],
    eval_dataset=ds['eval'],
    data_collator=collator,
)

trainer.train()

Tracking run with wandb version 0.24.0

Run data is saved locally in /content/wandb/run-20260204_074310-a5r5z066

Syncing run sandy-resonance-241 to Weights & Biases (docs)

View project at https://wandb.ai/tripathysagar08/huggingface

View run at https://wandb.ai/tripathysagar08/huggingface/runs/a5r5z066

[114/114 11:21, Epoch 2/2]

Step	Training Loss	Validation Loss
10	1.602411	1.565818
20	1.530622	1.508247
30	1.525805	1.487836
40	1.508146	1.478621
50	1.526377	1.473545
60	1.468810	1.470501
70	1.435371	1.469588
80	1.413604	1.468731
90	1.483781	1.468555
100	1.450739	1.468438
110	1.438403	1.468450

There were missing keys in the checkpoint model loaded: ['lm_head.weight'].

TrainOutput(global_step=114, training_loss=1.4891439320748312, metrics={'train_runtime': 688.8018, 'train_samples_per_second': 5.226, 'train_steps_per_second': 0.166, 'total_flos': 4313212071187968.0, 'train_loss': 1.4891439320748312, 'epoch': 2.0})

Result

_= [vibe_check(i) for i in lis]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The following them even anything anywhere anyone found that could make it in a few above, but not any reason for me no equal with my own choices:4. He made people who may certainly said I see nothing far behind us all our answers to be
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
I never take any anything here even feel them through it may make everything that made in various people, but all above a variety of these answers:0 acres out nothing behind the entire head. The following individuals come over 24thurism," no

The best model is not loaded might be a bug. So we are reloading the model again from the best check point

print(trainer.state.best_model_checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    trainer.state.best_model_checkpoint,
    device_map='auto',
    dtype=dtype)

./sft_output/checkpoint-100

_= [vibe_check(i) for i in lis]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.

<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital city of France, located in southern Europe. It is known for its rich history and stunning architecture. The capital was originally called Paris but has since become synonymous with French culture due to its role as a hub for art, literature, music,
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
I'm a human being. I am beautiful, generous and kind-hearted.?|| 0 stars (1 vote)%||

25 years ago someone suggested that we should start using the word "human" in our name to describe ourselves or others

Why the results are still underwhelming:

Dataset size — 1,800 training samples is tiny for SFT (production models use 10k–100k+)
Model size — 135M parameters is very small; larger models learn faster
Training duration — only 2 epochs

So what did improve? Compare the before/after outputs — the model now tries to answer the question (mentions “Paris”, “France”, “capital”) instead of producing completely unrelated text about masons or soldiers. That’s SFT working! With more data and epochs, responses would become coherent.

Conclusion

In this experiment, we implemented SFT from first principles using the HuggingFace Trainer API. The approach is straightforward: we steer a pre-trained model toward following simple instructions by training on conversation data with masked user turns.

Key takeaways:

SFT is about behavior, not knowledge — with curated data, models absorb different styles and response patterns
Prompt masking matters — we only train on assistant responses, not user inputs
Balance is critical — too little training and the model won’t adapt; too much leads to catastrophic forgetting, where the model loses its pre-trained capabilities and performs worse overall

This is just the first step in the RLHF pipeline. Next comes reward modeling and reinforcement learning to further align the model with human preferences.