Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

George Kour; Itay Nakash; Ateret Anaby-Tavor; Michal Shmueli-Scheuer

arXiv:2505.19621·cs.AI·May 27, 2025

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

George Kour, Itay Nakash, Ateret Anaby-Tavor, Michal Shmueli-Scheuer

PDF

1 Video

TL;DR

This paper introduces the POBS benchmark to evaluate LLMs' subjective preferences and beliefs, examining how test-time compute and model updates influence these traits, revealing limited improvements and concerning biases.

Contribution

The paper develops the POBS benchmark for assessing LLMs' subjective inclinations and analyzes the impact of reasoning, self-reflection, and model updates on these properties.

Findings

01

Test-time compute mechanisms yield limited improvements in bias and consistency.

02

Newer models tend to be less consistent and more biased.

03

The POBS benchmark effectively measures subjective tendencies of LLMs.

Abstract

As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models· underline