LOVA3: Learning to Visual Question Answering, Asking and Assessment

Henry Hengyuan Zhao; Pan Zhou; Difei Gao; Zechen Bai; Mike Zheng Shou

arXiv:2405.14974·cs.CV·February 21, 2025

LOVA3: Learning to Visual Question Answering, Asking and Assessment

Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou

PDF

Open Access 1 Repo 4 Models 2 Datasets 1 Reviews

TL;DR

LOVA3 introduces a novel framework that enhances multimodal large language models by enabling them to answer, ask, and assess questions about images, leading to improved understanding and performance across various benchmarks.

Contribution

The paper presents a new training framework with two tasks, GenQA and EvalQA, and a benchmark, EvalQABench, to develop questioning and assessment skills in MLLMs, which were previously underexplored.

Findings

01

Enhanced MLLMs show consistent performance improvements.

02

Introduction of EvalQABench provides a new evaluation standard.

03

Training with LOVA3 tasks improves multimodal comprehension.

Abstract

Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. By enhancing these capabilities, humans can more effectively utilize data, leading to better comprehension and learning outcomes. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. Inspired by the human learning mechanism, we introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment," designed to equip MLLMs with these additional capabilities. Our approach involves the creation of two supplementary training tasks GenQA and EvalQA, aiming at fostering the skills of asking and assessing questions in the context of images. To develop the questioning ability, we compile a comprehensive set of…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 3Confidence 4

Strengths

1. LOVA3 introduces a strategy that extends beyond traditional VQA tasks by incorporating question generation and evaluation. 2. The creation of EvalQABench provides a rigorous way to test and improve MLLMs. 3. The multiple perspectives of experimental results provide insights of the proposed framework across multiple benchmarks.

Weaknesses

1. Incorporating additional tasks like GenQA and EvalQA, but the two tasks are also the existing steps of the visual language instruction generation for visual question answering (e.g. SEED-Bench) or visual instruction tuning (e.g., LLaVa-Bench). They also used LLMs or MLLMs for the dataset generation and validation. To explained the special novelty or contribution would be better. 2. The work doesn't provide detailed explanations on how to validate the generated data quality from humans instead

Code & Models

Repositories

showlab/lova3
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Focus