Portfolio 1
title: “NanoGPT Preference Alignment (DPO) for Arithmetic Tasks” excerpt: “Implemented Direct Preference Optimization on a NanoGPT pretrained model to solve algebraic equations.
” collection: portfolio —
Timeline: Oct. 2025 - Nov. 2025
Role: Core Developer (Course Project)
- Implemented Direct Preference Optimization (DPO) on a NanoGPT pretrained model to empower its ability in arithmetic and algebraic equation solving.
- Pipeline: Designed a two-stage fine-tuning pipeline (SFT + DPO) based on the provided training framework; systematically explored weight coefficients and hyperparameter configurations in a CUDA environment.
- Data Engineering: Constructed 100,000+ preference pairs; generated positive samples via explicit reasoning steps and negative samples from the base model.
- Outcome: Achieved >90% accuracy in arithmetic and one-variable algebra tasks, significantly outperforming the base model (~0% accuracy).
