FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Chengyue Huang; Brisa Maneechotesuwan; Shivang Chopra; Zsolt Kira

arXiv:2505.21755·cs.CV·June 24, 2025

FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering

Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira

PDF

1 Repo

TL;DR

This paper introduces FRAMES-VQA, a comprehensive benchmark for evaluating the robustness of fine-tuning methods in visual question answering systems across diverse multi-modal distribution shifts, including in-distribution and out-of-distribution scenarios.

Contribution

It proposes a new benchmark for multi-modal VQA robustness evaluation, categorizes datasets into various shift types, and provides detailed analyses of distribution shifts and modality importance.

Findings

01

Existing fine-tuning methods vary in robustness across shifts.

02

Distribution shifts can be quantified using Mahalanobis distance.

03

Interactions between uni- and multi-modal shifts influence model performance.

Abstract

Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chengyuehuang511/frames-vqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.