VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi

TL;DR
This paper introduces VQ-VA World, a data-centric framework for training models to generate images in response to visual questions, significantly improving open-source VQ-VA performance and providing a new benchmark for evaluation.
Contribution
It presents a large-scale, targeted data construction pipeline and a human-curated benchmark, advancing open-source VQ-VA capabilities and evaluation methods.
Findings
Training with VQ-VA World data improves LightFusion's score to 53.06 on IntelligentBench.
The approach surpasses prior open-source baselines by a large margin.
Results narrow the gap toward proprietary systems like NanoBanana and GPT-Image.
Abstract
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
