Binary Beats Tiered: GRPO Reward Design on GSM8K

A 0.5B model goes from 25% to 50% on grade-school math — and the simpler reward wins.

I wanted to test a hypothesis: to train a SLM(Small language model) to solve basic math problems. To achieve this, I implemented an RL pipeline with SFT followed by RLVR(RL with Verifiable Reward). I picked Qwen2.5-0.5B - small enough to LoRA fine-tune on a free Colab GPU and GSM8K as the training dataset. The process of RLVR requires the model to generate multiple generations, then use a deterministic function that scores each generation. The model then tries to maximize the reward. This post walks through building the full end-to-end RL training and reasoning behind it.

Spoiler alert: the binary reward won. The rest of this post is about why, and what I learned along the way.

The Problem

To teach a Small Model to Reason : GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm from DeepSeek. The core idea: generate multiple responses to the same prompt, score them with a reward function, and update the policy(model) to favor the better ones — no critic model needed. The “group relative” part means rewards are normalized within each group, so the model learns from relative quality, not absolute scores. Simple, elegant, and cheap to run.

But here’s the open question: what reward signal do you actually give it?

Reason behind picking GSM8K as it’s ideal for testing reward design. The answers are just numbers — easy to verify programmatically. The problems require multi-step reasoning, so the model can’t just pattern-match. And everyone benchmarks on it, so the results are comparable.

Pipeline: Base → SFT → GRPO

Setup

Component	Details
Base model	Qwen2.5-0.5B
Algorithm	SFT → GRPO (two-stage)
LoRA	r=32, α=16 (SFT) / r=16, α=32 (GRPO) — higher effective scaling (α/r = 2) during GRPO to amplify reward signal
Train data	1024 GSM8K examples
Eval	660 examples (50% of GSM8K test)
Hardware	Colab T4/L4 GPU

The stack: trl and transformers for SFT and GRPO training, peft for LoRA, torch as the backend, wandb for experiment tracking, and gsm8k_utils — a small custom library wrapping prompt formatting, answer extraction, and the GRPOExperiment class for reproducible runs.

The first step was formatting GSM8K into a chat template — system prompt + user question + assistant response ending with The answer is: {number}. This gives the model a consistent structure to learn from during SFT, and a parseable format to reward during GRPO.

Experiment Iteration

Stage 1: SFT — Teaching the Format

Before throwing RL at this model, it needed to learn the format — how to take a question, reason through it step-by-step, and land on The answer is: {number}. That’s what SFT does here: not teaching math, just teaching manners or following given instructions.

Suspiciously good baseline. The base model scored 24.39% on 660 eval examples before any training. For a 0.5B model, that’s surprisingly high — there’s a decent chance GSM8K leaked into its pre-training data. I noted it and moved on.

Light touch. I fine-tuned with LoRA (r=32, α=16) on just 1024 examples for 1 epoch. More epochs actually hurt — the loss kept dropping but eval accuracy degraded, a classic sign of overfitting/catastrophic forgetting. One epoch was the sweet spot: 24.39% → 31.36%.

Vibe check. The model gets the right answer… and then doesn’t know when to stop:

The answer is: 126.książka
The answer is: 126.książka
The answer is: 126.książka
...

It picked up the format and the reasoning, but its generation quality is terrible — repetition loops, garbage tokens after the answer. This is fine for eval (we just parse the first number), but it shows why we need GRPO: SFT taught the format, now RL needs to teach quality.

Stage 2: Binary vs Tiered Reward

GRPO needs a reward signal to rank its rollouts. I tested two: a carefully tiered partial-credit scheme, and a dead-simple binary (right/wrong).

My hunch: tiered rewards might encourage reward hacking — the model optimizing for easy partial credit rather than actually solving the problem. Let’s see.

Tiered reward (partial credit based on closeness + format bonuses):

| Condition | Reward | |—|—| | Correct answer + correct format | 8.0 | | Correct answer + no format | 3.2 | | Wrong answer + correct format | 1.6 + 1.2 × closeness | | Wrong answer + no format | 1.2 × closeness | | No number extracted | 0.0 |

Where closeness = max(0, 1 - |answer - gold| / max(|gold|, 1)) — ranges from 0 (way off) to 1 (spot on), giving partial credit for being in the right ballpark.
- Pre-GRPO: 25.15% → Post-GRPO: 33.18%
Binary reward (just 1.0 or 0.0):

| Condition | Reward | |—|—| | Correct answer | 1.0 | | Incorrect answer | 0.0 |
- Pre-GRPO: 25.76% → Post-GRPO: 35.91%

(The slightly different baselines — 25.15% vs 25.76% — are from separate runs with different random seeds.)

Binary beat tiered by ~2.7 percentage points (35.91% vs 33.18%). The cleaner signal won — no partial credit to game, just right or wrong.

Stage 3: Hyperparameter Sweep

Binary reward locked in. Next question: what hyperparameters actually matter?

I swept four parameters one-at-a-time against a base config (max_steps=100, max_completion_length=512), 10 runs total on a Colab T4:

Parameter	Values tested	Best	Accuracy
`beta` (KL penalty)	0.01, 0.02, 0.04	0.04	37.42%
`learning_rate`	1e-5, 5e-5, 1e-4	5e-5	35.61%
`num_generations`	8, 16	16	32.58%
`sft_frac`	0.0, 0.5, 1.0	1.0	33.79%

A few things jumped out:

lr=1e-4 was catastrophic — accuracy dropped to 26.82%, below the SFT baseline. Too aggressive.
sft_frac=0.0 also hurt (27.73%) — pure GRPO without any SFT loss mixing destabilized training. Keeping the SFT loss as a regularizer matters.
beta=0.04 was the single biggest winner — a higher KL penalty kept the model from drifting too far from the SFT checkpoint.
num_generations=16 was chosen over 8 despite 8 scoring slightly higher in the sweep (32.58% vs 31.67%). More generations per step means faster iteration through the dataset — the model sees more of the training distribution sooner. The sweep difference was within noise, but 16 scales better with longer training.

Beta Sweep

Best combined config for the final run (200 steps): beta=0.04, lr=5e-5, num_generations=16, sft_frac=1.0.

Stage 4: Multi-Seed Validation

RL training is messy. The same config can land in different spots depending on random seed — LoRA init, data shuffling is fixed for uniformity of the experiment, rollout sampling all introduce variance. One good run doesn’t prove anything.

So I ran the best config (beta=0.04, lr=5e-5, num_generations=16, sft_frac=1.0) for 200 steps across three seeds:

Seed	Accuracy
42	40.00%
1337	37.27%
7	36.06%
Mean ± std	37.78% ± 2.61%

Multi-Seed Validation

That ~5% spread between best and worst seed is real — at 0.5B parameters, you’re at the mercy of the training lottery. But even the worst seed (34.85%) comfortably beats SFT alone (31.36%), so the signal is real. GRPO is consistently helping, even if the magnitude varies.

Why 200 steps? The sweep runs used 100 steps. Doubling it let the best config converge further — and helped separate configs that plateau early from ones that keep climbing.

Stage 5: Scaling Up Training

The sweep used 1024 examples and 100–200 steps. Time to take the best config and throw the full dataset at it.

One accidental discovery: the full run used beta=0.02 instead of the sweep-winning 0.04 — a copy-paste oversight. Despite this suboptimal setting, the model still hit nearly 50%. The right beta might push it even higher.

	Sweep (best seed)	Full run
Train data	1024 examples	~7,473 (full GSM8K)
Steps	200	1,000
Effective samples	~3,200	~32,000 (~4.3 epochs)
Accuracy	40.00%	49.70%

Full Dataset GRPO

That’s a 0.5B model hitting nearly 50% on GSM8K — double the base model’s 24.39%.

Vibe check. The generations are noticeably cleaner now:

In April, Natalia sold 48 clips. In May, she sold 48/2 = 24 clips.
Altogether, 48 + 24 = 72 clips.
The answer is: 72.柃

Still a garbage token at the end (柃 this time — the książka has evolved), but the reasoning chain is coherent, concise, and correct. The model learned to think, even if it hasn’t learned to stop.

Lessons Learned

The double format_prompt bug. My early eval code pre-formatted prompts with the chat template, then the eval function wrapped them again — double-wrapping. The model was being evaluated on garbled inputs, and I didn’t notice until I refactored into gsm8k_utils and the numbers shifted. Lesson: if your eval function formats inputs internally, don’t pre-format them too.

max_length=256 silently ate my best examples. SFT was configured with max_length=256 tokens, but ~15% of training examples (153/1024) exceeded that — the hardest problems with the longest reasoning chains. Exactly the ones I most wanted the model to learn from. Bumping to 512 fixed it. Always check how much of your data survives truncation.

RL variance is no joke at 0.5B. Three seeds, same config: 36.06%, 37.27%, 40.00%. A ~4% spread. At this scale, a single run can easily mislead you. Always validate with multiple seeds before drawing conclusions.

Code

All notebooks, configs, and gsm8k_utils are on GitHub: tripathysagar/rlhf-gsm8k

Training logs on W&B: grpo-gsm8k | grpo-gsm8k-full