Test-Time Preference Optimization: On-the-Fly Alignment via Iterative   Textual Feedback

Yafu Li; Xuyang Hu; Xiaoye Qu; Linjie Li; Yu Cheng

arXiv:2501.12895·cs.CL·January 23, 2025

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Test-time Preference Optimization (TPO), a method that aligns large language models with human preferences during inference by iteratively refining responses using textual feedback, without retraining the model.

Contribution

TPO is a novel framework that enables on-the-fly alignment of LLMs with human preferences through textual critiques, eliminating the need for parameter updates.

Findings

01

TPO improves alignment with human preferences across multiple benchmarks.

02

A few TPO steps can surpass models that are explicitly aligned.

03

TPO scales efficiently with inference search width and depth.

Abstract

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yafuly/tpo
noneOfficial

Videos

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Advanced Text Analysis Techniques