Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

D'Oosterlinck, Karel; Xu, Winnie; Develder, Chris; Demeester, Thomas; Singh, Amanpreet; Potts, Christopher; Kiela, Douwe; Mehri, Shikib

doi:10.1162/tacl_a_00748

Simple item page Full metadata Statistics

dc.contributor.author	D'Oosterlinck, Karel
dc.contributor.author	Xu, Winnie
dc.contributor.author	Develder, Chris
dc.contributor.author	Demeester, Thomas
dc.contributor.author	Singh, Amanpreet
dc.contributor.author	Potts, Christopher
dc.contributor.author	Kiela, Douwe
dc.contributor.author	Mehri, Shikib
dc.contributor.imecauthor	D'Oosterlinck, Karel
dc.contributor.imecauthor	Develder, Chris
dc.contributor.imecauthor	Demeester, Thomas
dc.contributor.orcidimec	D'Oosterlinck, Karel::0000-0003-1695-1014
dc.contributor.orcidimec	Develder, Chris::0000-0003-2707-4176
dc.contributor.orcidimec	Demeester, Thomas::0000-0002-9901-5768
dc.date.accessioned	2025-05-19T10:36:19Z
dc.date.available	2025-05-17T05:45:22Z
dc.date.available	2025-05-19T10:36:19Z
dc.date.issued	2025
dc.description.abstract	Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code and datasets are available.
dc.description.wosFundingText	We thank Kawin Ethayarajh, Eugen Hotaj, and Nathan Lambert for their feedback. We thank Stas Bekman for his help and support. K. D. gratefully acknowledges funding from the FWO Fundamental Research PhD Fellowship (11632223N). We also thank our anonymous reviewers for their valuable comments, which helped improve the clarity and quality of this work.
dc.identifier.doi	10.1162/tacl_a_00748
dc.identifier.issn	2307-387X
dc.identifier.uri	https://imec-publications.be/handle/20.500.12860/45682
dc.publisher	MIT PRESS
dc.source.beginpage	442
dc.source.endpage	460
dc.source.issue	/
dc.source.journal	TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
dc.source.numberofpages	19
dc.source.volume	13
dc.title	Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
dc.type	Journal article
dspace.entity.type	Publication
Files	Original bundle Name: 8806.pdf Size: 873.91 KB Format: Adobe Portable Document Format Description: Published Download
Publication available in collections:	Articles

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Date