Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression
Farima Fatahi Bayat, Xin Liu, H. V. Jagadish, Lu Wang

TL;DR
LITO is a learnable intervention method that adaptively enhances language model truthfulness by tuning intervention intensity based on context and uncertainty, improving factual accuracy without sacrificing task performance.
Contribution
The paper introduces LITO, a novel adaptive intervention technique that automatically adjusts intervention strength for truthfulness in language models, outperforming fixed approaches.
Findings
LITO improves factual accuracy across multiple LLMs and datasets.
LITO maintains task accuracy while increasing truthfulness.
Adaptive intervention outperforms fixed strategies in truthfulness enhancement.
Abstract
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the "truthful directions" previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReservoir Engineering and Simulation Methods
