TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization Apple Machine Learning Research
Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences… Read More »TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization Apple Machine Learning Research