Failures Are Fated, But Can Be Faded: Characterizing and Mitigating   Unwanted Behaviors in Large-Scale Vision and Language Models

Som Sagar; Aditya Taparia; Ransalu Senanayake

arXiv:2406.07145·cs.LG·June 17, 2024

Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models

Som Sagar, Aditya Taparia, Ransalu Senanayake

PDF

Open Access 1 Repo

TL;DR

This paper introduces a deep reinforcement learning-based post-hoc method to explore, characterize, and mitigate failure modes in large-scale vision and language models, enhancing their reliability and safety.

Contribution

It presents a novel approach using reinforcement learning and limited human feedback to map and reshape failure landscapes in complex models.

Findings

01

Effective in identifying failure modes across vision, language, and multimodal tasks

02

Able to restructure models to avoid undesirable failure behaviors

03

Applicable to pre-trained discriminative and generative models

Abstract

In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we introduce a post-hoc method that utilizes \emph{deep reinforcement learning} to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

somsagar07/FailureShiftRL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Interpreting and Communication in Healthcare