Small LLMs Can’t Add—But They Can Learn to Ask

A Small LM can’t add 6-digit numbers—but it can learn to ask a calculator to do it, and generalize to arbitrary numbers.

I wanted to test a hypothesis: can a small language model master integer addition through training? Given two n-digit integers, can it learn to handle carries and digit alignment? I chose the base model(not an Instruct one) as it is mostly trained on predicting next token rather than following instruction and solving complex task. I considered SmolLM2-135M, for this experiment as it is small and efficient enough to run many experiments on colab T4.

Spoiler alert: pure chain-of-thought training hit a hard ceiling, but tool use opened a surprising path forward.

Notebooks

Description Link
COT training experiment Open In Colab
Tool use SFT experiment Open In Colab

Chain-Of-Thought Experiments

Data format

As our experiment is small enough we can generate synthetic data for the addition which will give us more control. The data contain following structure:

  1. padding the shorter number with leading zeros
  2. init the carry = 0
  3. perform adding from the rightmost part of the numbers (least significant digit) with previous carry
  4. then extract the result digit and carry
  5. repeat prev step for all the digits in the numbers
  6. if final carry > 0 then it becomes the leading digit

A simple COT for addition of 1 with 99 is brokendown to 01 + 99; col1: 1+9+0=10, write 0 carry 1; col2: 0+9+1=10, write 0 carry 1; final carry 1; answer=100

Training Setup

I used SmolLM2-135M (base model) with the following configuration: - Epochs: 5 - Batch size: 32 - Learning rate: 5e-4

For data, I generated synthetic addition problems across digit ranges: - 2-digit: 500 train / 50 val / 50 test - 3-digit: 6000 train / 550 val / 550 test

I ran two experiments: 1. Train on 1-4 digit combinations, test on 5 digits 2. Train on 1-5 digit combinations, test on 6 digits

In-range Results

Within the training digit range, the model achieved near-perfect accuracy. It learned the COT format correctly:

10000 + 999 = 10000+00999, 0+9=9→9c0, 0+9=9→9c0, 0+9=9→9c0, 0+0=0→0c0, 1+0=1→1c0, →10999

The column-by-column reasoning and carry tracking worked flawlessly for problems within the trained range.

Out-of-range Failure

Testing on numbers beyond the training range revealed a hard ceiling. The model systematically truncated inputs to the maximum trained digit length rather than padding leading zeros:

222222 + 22222 = 22222+22222, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, 2+2=4→4c0, →44444

Expected answer: 244444. The model dropped the leading 2 from 222222, reducing it to a 5-digit problem it knew how to solve.

Key observation: The model perfectly memorized the COT pattern but couldn’t extend it beyond its training distribution.

What worked: The model correctly extracted and aligned digits from the input.
What failed: Following through on the multi-step addition algorithm.

Looking at the failure more carefully: the model did correctly identify both numbers from the input. The breakdown happened at the carry logic, not the parsing. This suggests the model’s strength lies in pattern recognition and extraction, not multi-step computation. Meanwhile, a CPU handles arithmetic trivially. So rather than teaching the model to perform addition, why not teach it to extract the operands and delegate the computation? This is the core idea behind tool use.

Tool Use Experiment

Data Format

from transformers import AutoTokenizer

model_nm = "HuggingFaceTB/SmolLM2-135M"
tok = AutoTokenizer.from_pretrained(model_nm)

Instead of the complete tracing of through addition our scope of data now limited. For input: a + b will be converted to <tool>add(a, b)</tool>. Note: The tokenizer for the model does not have any special token for tool use. So I used raw string rather than expanding the vocab to keep the embedding fixed.

To give more entropy in the data, randomly select from the following formats

formats = [
    "{a} + {b}",
    "Add: {a}, {b}",
    "{a} + {b} = ",
    "{a}+{b}=?",
    "sum of {a} and {b}",
    "What is {a} + {b}?",
    "{a} plus {b}",
    "Calculate {a} + {b}"
]

For prompt/input 'Add: 1, 2' expects output of '<tool>add(1, 2)</tool>'.

Training Setup

Same base model (SmolLM2-135M), with slightly adjusted hyperparameters: - Epochs: 3 - Batch size: 16 - Learning rate: 2e-4 - LR scheduler: cosine

For data, I generated addition problems for all digit combinations from 1-7 digits: - Train: ~18,340 examples (also worked with just 2,500) - Validation: ~2,280 examples - Test: ~2,280 examples

Out-of-distribution testing used numbers in the 8-11 digit range. I used Hugging Face’s SFTTrainer for this experiment, which is designed for supervised fine-tuning on instruction-style data.

In-range Results

The model achieved 100% accuracy on the in-distribution test set. It learned to extract operands from all prompt formats and emit the correct tool call syntax.

Out-of-range Results

Testing on 8-11 digit numbers (well beyond the 1-7 digit training range):

Test Category Accuracy
both_large (8-11 + 8-11) 99.3% (894/900)
small_large (1-7 + 8-11) 99.6% (2091/2100)
large_small (8-11 + 1-7) 99.8% (2096/2100)

Unlike COT, the model generalized 4+ digits beyond its training distribution with minimal errors.

Surprising Generalizations

The model also handled formats not seen in training:

Input Output
-12345678 + 876888881 add(12345678, 876888881)
0.1 + 87654321 add(0.1, 87654321)
0.1 + 0.24 add(0.1, 0.24)

It learned to extract number-like tokens, including decimals—though it drops negative signs.

Failure Analysis

The few errors that occurred follow a distinct pattern: digit duplication. The model adds an extra repeated digit at the end of numbers:

Prompt Expected Got
16842957+1773685222=? add(16842957, 1773685222) add(16842957, 17736852222)
sum of 241527736 and 375562333 add(241527736, 375562333) add(241527736, 3755623333)
77043+72111115=? add(77043, 72111115) add(77043, 721111115)
3921334444 + 8630895 = add(3921334444, 8630895) add(39213344444, 8630895)

Instead, it’s an autoregressive stopping problem: during generation, when the model sees a sequence like ...222, it has learned that “more of the same” is likely — and occasionally overshoots, emitting one digit too many.

failures = ['1773685222', '375562333', '616497833', '72111115'] 

for num in failures:
    tokens = tok.encode(num)
    decoded = [tok.decode([t]) for t in tokens]
    print(f"{num}{len(tokens)} tokens: {decoded}")
1773685222 → 10 tokens: ['1', '7', '7', '3', '6', '8', '5', '2', '2', '2']
375562333 → 9 tokens: ['3', '7', '5', '5', '6', '2', '3', '3', '3']
616497833 → 9 tokens: ['6', '1', '6', '4', '9', '7', '8', '3', '3']
72111115 → 8 tokens: ['7', '2', '1', '1', '1', '1', '1', '5']

Notice that failures cluster around numbers with repeated trailing digits. The tokenizer encodes each digit separately (no multi-digit tokens), so this isn’t a tokenization boundary issue.

Conclusion

Approach In-range Out-of-range Failure Mode
COT ~100% 0% generalization Truncates inputs
Tool use 100% 99%+ Occasional digit duplication

One can extend the dataset to handle other arithmetic operations—if addition can be learned this easily, multiplication or division likely can too. The model even showed some ability to extract decimals, though edge cases like negative numbers weren’t handled correctly. COT requires the model to perform arithmetic—tracking carries across variable-length sequences. Tool use only requires the model to extract and copy numbers into a fixed format.

Key insight: by letting the model leverage the CPU for computation rather than teaching complex algorithms. By converting the problem from multi-step reasoning to number extraction, the model generalizes far beyond its training distribution.