Rationalizing Predictions by Adversarial Information Calibration
Lei Sha, Oana-Maria Camburu, Thomas Lukasiewicz

TL;DR
This paper introduces an adversarial information calibration method to improve the extraction of faithful and fluent rationales in AI model explanations, especially for natural language tasks, by jointly training a black-box model and a rationale generator.
Contribution
It proposes a novel joint training approach using adversarial calibration and language-model regularization to enhance rationale quality and fidelity in AI explanations.
Findings
Improved rationale extraction in sentiment analysis and hate speech tasks.
Enhanced legal domain explanation accuracy.
Better fluency and faithfulness of extracted rationales.
Abstract
Explaining the predictions of AI models is paramount in safety-critical applications, such as in legal or medical domains. One form of explanation for a prediction is an extractive rationale, i.e., a subset of features of an instance that lead the model to give its prediction on that instance. For example, the subphrase ``he stole the mobile phone'' can be an extractive rationale for the prediction of ``Theft''. Previous works on generating extractive rationales usually employ a two-phase model: a selector that selects the most important features (i.e., the rationale) followed by a predictor that makes the prediction based exclusively on the selected features. One disadvantage of these works is that the main signal for learning to select features comes from the comparison of the answers given by the predictor to the ground-truth answers. In this work, we propose to squeeze more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
