Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling
Hongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin Chen

TL;DR
This paper argues that increasing knowledge density in training data, rather than task diversity, is key to scaling multimodal large language models effectively.
Contribution
It demonstrates that task-specific supervision adds little new information and that enriching training data with structured knowledge improves model performance.
Findings
VQA signals can be reconstructed from image captions with minimal performance loss.
Knowledge density correlates more strongly with performance than task diversity.
Structured caption enrichment enhances multimodal model capabilities.
Abstract
Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density -- through structured caption enrichment and cross-modal knowledge injection -- leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
