LOVA3: Learning to Visual Question Answering, Asking and Assessment
Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou

TL;DR
LOVA3 introduces a novel framework that enhances multimodal large language models by enabling them to answer, ask, and assess questions about images, leading to improved understanding and performance across various benchmarks.
Contribution
The paper presents a new training framework with two tasks, GenQA and EvalQA, and a benchmark, EvalQABench, to develop questioning and assessment skills in MLLMs, which were previously underexplored.
Findings
Enhanced MLLMs show consistent performance improvements.
Introduction of EvalQABench provides a new evaluation standard.
Training with LOVA3 tasks improves multimodal comprehension.
Abstract
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. Inspired by the human learning mechanism, we introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment," designed to equip MLLMs with these additional capabilities. Our approach involves the creation of two supplementary training tasks GenQA and EvalQA, aiming at fostering the skills of asking and assessing questions in the context of images. To develop the questioning ability, we compile a comprehensive set of…
Peer Reviews
Decision·NeurIPS 2024 poster
1. LOVA3 introduces a strategy that extends beyond traditional VQA tasks by incorporating question generation and evaluation. 2. The creation of EvalQABench provides a rigorous way to test and improve MLLMs. 3. The multiple perspectives of experimental results provide insights of the proposed framework across multiple benchmarks.
1. Incorporating additional tasks like GenQA and EvalQA, but the two tasks are also the existing steps of the visual language instruction generation for visual question answering (e.g. SEED-Bench) or visual instruction tuning (e.g., LLaVa-Bench). They also used LLMs or MLLMs for the dataset generation and validation. To explained the special novelty or contribution would be better. 2. The work doesn't provide detailed explanations on how to validate the generated data quality from humans instead
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Focus
