Aligning Large Language Models for Faithful Integrity Against Opposing   Argument

Yong Zhao; Yang Deng; See-Kiong Ng; Tat-Seng Chua

arXiv:2501.01336·cs.CL·January 3, 2025

Aligning Large Language Models for Faithful Integrity Against Opposing Argument

Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AFICE, a framework that improves large language models' ability to maintain faithful and trustworthy responses during conversations by estimating response confidence and aligning responses with faithful statements.

Contribution

The paper proposes a novel confidence estimation method and a training framework to enhance LLMs' fidelity and trustworthiness against opposing arguments.

Findings

01

Significant improvement in maintaining faithful responses.

02

Enhanced trustworthiness in complex interactive settings.

03

Effective alignment of LLM responses with faithful statements.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaoy777/afice
pytorchOfficial

Videos

Aligning Large Language Models for Faithful Integrity against Opposing Argument· underline

Taxonomy

TopicsHate Speech and Cyberbullying Detection

MethodsALIGN