Portfolio 1

title: “NanoGPT Preference Alignment (DPO) for Arithmetic Tasks” excerpt: “Implemented Direct Preference Optimization on a NanoGPT pretrained model to solve algebraic equations.
” collection: portfolio —

Timeline: Oct. 2025 - Nov. 2025
Role: Core Developer (Course Project)

Implemented Direct Preference Optimization (DPO) on a NanoGPT pretrained model to empower its ability in arithmetic and algebraic equation solving.
Pipeline: Designed a two-stage fine-tuning pipeline (SFT + DPO) based on the provided training framework; systematically explored weight coefficients and hyperparameter configurations in a CUDA environment.
Data Engineering: Constructed 100,000+ preference pairs; generated positive samples via explicit reasoning steps and negative samples from the base model.
Outcome: Achieved >90% accuracy in arithmetic and one-variable algebra tasks, significantly outperforming the base model (~0% accuracy).

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Zhongheng Liu | 刘仲衡

Share on