NanoGPT Preference Alignment (DPO) for Arithmetic Tasks

This course project explored whether a compact GPT-style model can be preference-aligned to solve arithmetic and one-variable algebra tasks more reliably.

nanogpt-dpo-arithmetic — training.log
$ python train_dpo.py --model nanogpt --task algebra --beta 0.1 [data] loaded 100,000+ preference pairs [sft ] warm-start checkpoint: ckpt_sft_iter_20000.pt [dpo ] iter 004000 | loss 0.412 | chosen_logp ↑ | rejected_logp ↓ [eval] arithmetic accuracy: 91.7% | one-variable algebra: 90.4% [case] solve: 3x + 7 = 22 reasoning: subtract 7 → 3x = 15 → x = 5 answer: x = 5 ✓ [base] accuracy ≈ 0% → [aligned] accuracy > 90%

Highlights

  • Built a two-stage fine-tuning workflow: supervised fine-tuning followed by Direct Preference Optimization.
  • Constructed more than 100,000 preference pairs by pairing explicit-reasoning positive samples with negative generations from the base model.
  • Tuned preference weights and hyperparameters in a CUDA environment.
  • Achieved over 90% accuracy on arithmetic and one-variable algebra tasks, substantially improving over the base model.