Anchored Alignment for Self-Explanations Enhancement

Luis Felipe Villa-Arenas; Ata Nizamoglu; Qianli Wang; Sebastian; M\"oller; Vera Schmitt

arXiv:2410.13216·cs.AI·October 18, 2024

Anchored Alignment for Self-Explanations Enhancement

Luis Felipe Villa-Arenas, Ata Nizamoglu, Qianli Wang, Sebastian, M\"oller, Vera Schmitt

PDF

Open Access 3 Reviews

TL;DR

This paper proposes an alignment methodology for large language models to improve their self-explanation capabilities without relying on annotated rationales, using novel techniques like Anchor Preference Pairs and tailored preference optimization.

Contribution

It introduces a new alignment approach with Anchor Preference Pairs and tailored strategies, enhancing explanation quality and accuracy in LLMs without annotated data.

Findings

01

Significant improvement in explanation quality.

02

Maintains accuracy comparable to other fine-tuning methods.

03

Effective categorization of model outputs enhances preference selection.

Abstract

In this work, we introduce a methodology for alignment designed to enhance the ability of large language models (LLMs) to articulate their reasoning (self-explanation) even in the absence of annotated rationale explanations. Our alignment methodology comprises three key components: explanation quality assessment, self-instruction dataset generation, and model alignment. Additionally, we present a novel technique called Alignment with Anchor Preference Pairs, which improves the selection of preference pairs by categorizing model outputs into three groups: consistently correct, consistently incorrect, and variable. By applying tailored strategies to each category, we enhance the effectiveness of Direct Preference Optimization (DPO). Our experimental results demonstrate that this approach significantly improves explanation quality while maintaining accuracy compared to other fine-tuning…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The main strength of this paper lies in its introduction of a novel and well-executed modeling of self-explanation alignment. The concept is both sound and straightforward, making it applicable to other QA tasks and LLM models. In terms of clarity, the paper is generally well-written, though certain aspects could benefit from further elaboration.

Weaknesses

One main concern is the effectiveness of the proposed method in answer accuracy and explanation quality. While M_{Anchor} outperforms M_{Rank}, the differences shown in Table 1 are not statistically significant. Are the reported ± values standard deviations or standard errors? Regarding explanation quality, it would strengthen the paper to validate the new evaluation framework by measuring its correlation with human judgments. Even if LLMs perform similarly to human raters, the reliability of a

Reviewer 02Rating 6Confidence 4

Strengths

**S1**. The paper presents a framework for assessing the quality of self-explanations based on various criteria, including logical coherence, clarity, depth of argumentation, and factual accuracy, using an LLM as the evaluator. **S2**. The authors provide an analysis of the relationship between alignment methods and self-explanation quality, based on their evaluation framework. **S3**. They introduce the "Alignment with Anchor Points" method to create high-quality preference pairs by grouping

Weaknesses

**W1**. Although the authors state in Line 69 that their approach differs from faithfulness, they should be more precise when articulating the aim of their approach. The phrase "effectively conveying the model's reasoning" is broad and can be interpreted as faithfulness. A more suitable term might be "improving the plausibility of explanations," as this study aims to replace human-annotated preference pairs with LLM annotations, and the criteria selected resemble those that make explanations pla

Reviewer 03Rating 3Confidence 5

Strengths

* For employing LLM-as-a-Judge, the authors use 5 evaluation criteria and include an ablation experiment in Figure 1 to compare which one was effective. This is very insightful. * The methodology is clear and very intuitive.

Weaknesses

* If the main point is exploring how to utilize CoT rationales when they are not available, I think there should also be a upper-bound score, which is curating CoT rationales from a stronger teacher model (for instance, Llama-3.1-70B-Instruct, which was used as the evaluator). Without that, it is hard to tell how effective the method is compared to a strong baseline which is distillation. * There isn't enough ablation experiments to support the design choices in Section 3.3. For example: *

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management