Towards Flexible Evaluation for Generative Visual Question Answering
Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang

TL;DR
This paper introduces a semantics-based evaluation framework for VQA, proposing a flexible evaluator that better captures open-ended responses and outperforms existing methods, enhancing the assessment of multimodal models.
Contribution
It develops a novel Semantically Flexible VQA Evaluator (SFVE) and a dataset for analyzing VQA evaluators, addressing limitations of exact match metrics.
Findings
SFVE surpasses existing semantic evaluators significantly.
Model-based evaluation is feasible and effective.
The training scheme generalizes across different encoder architectures.
Abstract
Throughout rapid development of multimodal large language models, a crucial ingredient is a fair and accurate evaluation of their multimodal comprehension abilities. Although Visual Question Answering (VQA) could serve as a developed test field, limitations of VQA evaluation, like the inflexible pattern of Exact Match, have hindered MLLMs from demonstrating their real capability and discourage rich responses. Therefore, this paper proposes the use of semantics-based evaluators for assessing unconstrained open-ended responses on VQA datasets. As characteristics of VQA have made such evaluation significantly different than the traditional Semantic Textual Similarity (STS) task, to systematically analyze the behaviour and compare the performance of various evaluators including LLM-based ones, we proposes three key properties, i.e., Alignment, Consistency and Generalization, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems
