Aligning Large Language Models for Faithful Integrity Against Opposing Argument
Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua

TL;DR
This paper introduces AFICE, a framework that improves large language models' ability to maintain faithful and trustworthy responses during conversations by estimating response confidence and aligning responses with faithful statements.
Contribution
The paper proposes a novel confidence estimation method and a training framework to enhance LLMs' fidelity and trustworthiness against opposing arguments.
Findings
Significant improvement in maintaining faithful responses.
Enhanced trustworthiness in complex interactive settings.
Effective alignment of LLM responses with faithful statements.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHate Speech and Cyberbullying Detection
MethodsALIGN
