VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou; Zilong Chen; Zeyu Wang; Feng Li; Deyao Zhu; Zicheng Duan; Kunchang Li; Chaorui Deng; Hongyi Yuan; Haoqi Fan; Cihang Xie; Jianfei Cai; Hamid Rezatofighi

arXiv:2511.20573·cs.CV·November 26, 2025

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi

PDF

Open Access 1 Datasets

TL;DR

This paper introduces VQ-VA World, a data-centric framework for training models to generate images in response to visual questions, significantly improving open-source VQ-VA performance and providing a new benchmark for evaluation.

Contribution

It presents a large-scale, targeted data construction pipeline and a human-curated benchmark, advancing open-source VQ-VA capabilities and evaluation methods.

Findings

01

Training with VQ-VA World data improves LightFusion's score to 53.06 on IntelligentBench.

02

The approach surpasses prior open-source baselines by a large margin.

03

Results narrow the gap toward proprietary systems like NanoBanana and GPT-Image.

Abstract

This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

VQVA/BAGEL-World-data
dataset· 767 dl
767 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling