KL Penalty Control via Pe
KL Penalty Control via Perturbation for Direct Preference Optimization
KL Penalty Control via Perturbation for Direct Preference Optimization
arXiv:2502.13177v1 Announce Type: new
Abstract: Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the re…