from transformers import AutoTokenizer
model_nm = "HuggingFaceTB/SmolLM2-135M"
tok = AutoTokenizer.from_pretrained(model_nm)Small LLMs Can’t Add—But They Can Learn to Ask
I wanted to test a hypothesis: can a small language model master integer addition through training? Given two n-digit integers, can it learn to handle carries and digit alignment? I chose the base model(not an Instruct one) as it is mostly trained on predicting next token rather than following instruction and solving complex task. I considered SmolLM2-135M, for this experiment as it is small and efficient enough to run many experiments on colab T4.
Spoiler alert: pure chain-of-thought training hit a hard ceiling, but tool use opened a surprising path forward.
Notebooks
| Description | Link |
|---|---|
| COT training experiment | |
| Tool use SFT experiment |
Chain-Of-Thought Experiments
Data format
As our experiment is small enough we can generate synthetic data for the addition which will give us more control. The data contain following structure:
- padding the shorter number with leading zeros
- init the
carry = 0 - perform adding from the rightmost part of the numbers (least significant digit) with previous carry
- then extract the result digit and carry
- repeat prev step for all the digits in the numbers
- if final carry > 0 then it becomes the leading digit
A simple COT for addition of 1 with 99 is brokendown to 01 + 99; col1: 1+9+0=10, write 0 carry 1; col2: 0+9+1=10, write 0 carry 1; final carry 1; answer=100
Training Setup
I used SmolLM2-135M (base model) with the following configuration: - Epochs: 5 - Batch size: 32 - Learning rate: 5e-4
For data, I generated synthetic addition problems across digit ranges: - 2-digit: 500 train / 50 val / 50 test - 3-digit: 6000 train / 550 val / 550 test
I ran two experiments: 1. Train on 1-4 digit combinations, test on 5 digits 2. Train on 1-5 digit combinations, test on 6 digits
In-range Results
Within the training digit range, the model achieved near-perfect accuracy. It learned the COT format correctly:
10000 + 999 = 10000+00999, 0+9=9→9c0, 0+9=9→9c0, 0+9=9→9c0, 0+0=0→0c0, 1+0=1→1c0, →10999
The column-by-column reasoning and carry tracking worked flawlessly for problems within the trained range.
Out-of-range Failure
Testing on numbers beyond the training range revealed a hard ceiling. The model systematically truncated inputs to the maximum trained digit length rather than padding leading zeros:
222222 + 22222 = 22222+22222, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, →44444
Expected answer: 244444. The model dropped the leading 2 from 222222, reducing it to a 5-digit problem it knew how to solve.
Key observation: The model perfectly memorized the COT pattern but couldn’t extend it beyond its training distribution.
What worked: The model correctly extracted and aligned digits from the input.
What failed: Following through on the multi-step addition algorithm.
Looking at the failure more carefully: the model did correctly identify both numbers from the input. The breakdown happened at the carry logic, not the parsing. This suggests the model’s strength lies in pattern recognition and extraction, not multi-step computation. Meanwhile, a CPU handles arithmetic trivially. So rather than teaching the model to perform addition, why not teach it to extract the operands and delegate the computation? This is the core idea behind tool use.
Tool Use Experiment
Data Format
Instead of the complete tracing of through addition our scope of data now limited. For input: a + b will be converted to <tool>add(a, b)</tool>. Note: The tokenizer for the model does not have any special token for tool use. So I used raw string rather than expanding the vocab to keep the embedding fixed.
To give more entropy in the data, randomly select from the following formats
formats = [
"{a} + {b}",
"Add: {a}, {b}",
"{a} + {b} = ",
"{a}+{b}=?",
"sum of {a} and {b}",
"What is {a} + {b}?",
"{a} plus {b}",
"Calculate {a} + {b}"
]For prompt/input 'Add: 1, 2' expects output of '<tool>add(1, 2)</tool>'.
Training Setup
Same base model (SmolLM2-135M), with slightly adjusted hyperparameters: - Epochs: 3 - Batch size: 16 - Learning rate: 2e-4 - LR scheduler: cosine
For data, I generated addition problems for all digit combinations from 1-7 digits: - Train: ~18,340 examples (also worked with just 2,500) - Validation: ~2,280 examples - Test: ~2,280 examples
Out-of-distribution testing used numbers in the 8-11 digit range. I used Hugging Face’s SFTTrainer for this experiment, which is designed for supervised fine-tuning on instruction-style data.
In-range Results
The model achieved 100% accuracy on the in-distribution test set. It learned to extract operands from all prompt formats and emit the correct tool call syntax.
Out-of-range Results
Testing on 8-11 digit numbers (well beyond the 1-7 digit training range):
| Test Category | Accuracy |
|---|---|
| both_large (8-11 + 8-11) | 99.3% (894/900) |
| small_large (1-7 + 8-11) | 99.6% (2091/2100) |
| large_small (8-11 + 1-7) | 99.8% (2096/2100) |
Unlike COT, the model generalized 4+ digits beyond its training distribution with minimal errors.
Surprising Generalizations
The model also handled formats not seen in training:
| Input | Output |
|---|---|
-12345678 + 876888881 |
add(12345678, 876888881) |
0.1 + 87654321 |
add(0.1, 87654321) |
0.1 + 0.24 |
add(0.1, 0.24) |
It learned to extract number-like tokens, including decimals—though it drops negative signs.
Failure Analysis
The few errors that occurred follow a distinct pattern: digit duplication. The model adds an extra repeated digit at the end of numbers:
| Prompt | Expected | Got |
|---|---|---|
16842957+1773685222=? |
add(16842957, 1773685222) |
add(16842957, 17736852222) |
sum of 241527736 and 375562333 |
add(241527736, 375562333) |
add(241527736, 3755623333) |
77043+72111115=? |
add(77043, 72111115) |
add(77043, 721111115) |
3921334444 + 8630895 = |
add(3921334444, 8630895) |
add(39213344444, 8630895) |
Instead, it’s an autoregressive stopping problem: during generation, when the model sees a sequence like ...222, it has learned that “more of the same” is likely — and occasionally overshoots, emitting one digit too many.
failures = ['1773685222', '375562333', '616497833', '72111115']
for num in failures:
tokens = tok.encode(num)
decoded = [tok.decode([t]) for t in tokens]
print(f"{num} → {len(tokens)} tokens: {decoded}")1773685222 → 10 tokens: ['1', '7', '7', '3', '6', '8', '5', '2', '2', '2']
375562333 → 9 tokens: ['3', '7', '5', '5', '6', '2', '3', '3', '3']
616497833 → 9 tokens: ['6', '1', '6', '4', '9', '7', '8', '3', '3']
72111115 → 8 tokens: ['7', '2', '1', '1', '1', '1', '1', '5']
Notice that failures cluster around numbers with repeated trailing digits. The tokenizer encodes each digit separately (no multi-digit tokens), so this isn’t a tokenization boundary issue.
Conclusion
| Approach | In-range | Out-of-range | Failure Mode |
|---|---|---|---|
| COT | ~100% | 0% generalization | Truncates inputs |
| Tool use | 100% | 99%+ | Occasional digit duplication |
One can extend the dataset to handle other arithmetic operations—if addition can be learned this easily, multiplication or division likely can too. The model even showed some ability to extract decimals, though edge cases like negative numbers weren’t handled correctly. COT requires the model to perform arithmetic—tracking carries across variable-length sequences. Tool use only requires the model to extract and copy numbers into a fixed format.
Key insight: by letting the model leverage the CPU for computation rather than teaching complex algorithms. By converting the problem from multi-step reasoning to number extraction, the model generalizes far beyond its training distribution.