Publication:

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Date

 
dc.contributor.authorD'Oosterlinck, Karel
dc.contributor.authorXu, Winnie
dc.contributor.authorDevelder, Chris
dc.contributor.authorDemeester, Thomas
dc.contributor.authorSingh, Amanpreet
dc.contributor.authorPotts, Christopher
dc.contributor.authorKiela, Douwe
dc.contributor.authorMehri, Shikib
dc.contributor.imecauthorD'Oosterlinck, Karel
dc.contributor.imecauthorDevelder, Chris
dc.contributor.imecauthorDemeester, Thomas
dc.contributor.orcidimecD'Oosterlinck, Karel::0000-0003-1695-1014
dc.contributor.orcidimecDevelder, Chris::0000-0003-2707-4176
dc.contributor.orcidimecDemeester, Thomas::0000-0002-9901-5768
dc.date.accessioned2025-05-19T10:36:19Z
dc.date.available2025-05-17T05:45:22Z
dc.date.available2025-05-19T10:36:19Z
dc.date.issued2025
dc.description.abstractLarge Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. The interaction between model, paired data, and objective makes alignment a complicated procedure, sometimes producing subpar results. We study this and find that (i) preference data gives a better learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to better performance when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. We align Llama-3-8B-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human judgments. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. Our code and datasets are available.
dc.description.wosFundingTextWe thank Kawin Ethayarajh, Eugen Hotaj, and Nathan Lambert for their feedback. We thank Stas Bekman for his help and support. K. D. gratefully acknowledges funding from the FWO Fundamental Research PhD Fellowship (11632223N). We also thank our anonymous reviewers for their valuable comments, which helped improve the clarity and quality of this work.
dc.identifier.doi10.1162/tacl_a_00748
dc.identifier.issn2307-387X
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/45682
dc.publisherMIT PRESS
dc.source.beginpage442
dc.source.endpage460
dc.source.issue/
dc.source.journalTRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
dc.source.numberofpages19
dc.source.volume13
dc.title

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

dc.typeJournal article
dspace.entity.typePublication
Files

Original bundle

Name:
8806.pdf
Size:
873.91 KB
Format:
Adobe Portable Document Format
Description:
Published
Publication available in collections: