NanoGPT Preference Alignment (DPO) for Arithmetic Tasks
A two-stage SFT + DPO pipeline that aligns NanoGPT for arithmetic and one-variable algebra reasoning.
$ python train_dpo.py --task algebra [eval] arithmetic acc: 91.7% | algebra acc: 90.4% [base] ~0% → [aligned] >90%
