TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Yangfan He, Kuan Lu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

TL;DR
TextSquare introduces a large-scale, high-quality instruction tuning dataset for text-centric visual question answering, significantly improving open-source models and surpassing leading proprietary models on multiple benchmarks.
Contribution
The paper presents Square-10M, a novel high-quality instruction dataset generated using a new process, and demonstrates its effectiveness in enhancing model performance beyond existing models.
Findings
TextSquare outperforms previous open-source models and rivals top-tier models in text-centric VQA.
VQA reasoning data significantly improves accuracy and reduces hallucinations.
Scaling instruction data volume proportionally enhances model performance.
Abstract
Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally,…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Square-10M represents a significant scale for text-centric VQA instruction tuning. The data collection strategy provides a clear pipeline for generating high-quality instruction tuning data through self-questioning, answering, reasoning, and evaluation. The paper offers valuable insights into scaling laws for text-centric VQA instruction tuning, showing logarithmic relationships between data scale and model performance.
Limited Technical Novelty: The core methodology primarily combines existing techniques and reliance on closed-source model. The main finding that more high-quality instruction data improves performance is expected. There are lack of comparison with automatically generated OCR datasets such as webpages screenshots and pdf parsing library. The evaluation benchmark. The author-reported numbers come from the December 2023 Gemini technical report. However, GPT-4V and Gemini Pro have gone through
- The model performance is strong. The paper presents a feasible solution for achieving closed-source GPT-4V level performance on text-centric benchmarks with an open-source model. - The paper scaled the data to 10M and demonstrated its effectiveness by visualizing the scaling trend.
This is a solid paper if evaluated as an engineering report. However, as an ICLR submission, it falls short in terms of novelty and scientific contributions, and the ablations are insufficient. - Weak Novelty: The approach is essentially self-instruction and knowledge distillation. The proposed prompting methodology (Square) appears to be a straightforward implementation and lacks ablations to prove the effectiveness of each step. - Limited Scientific Findings: Although the abstract lists som
1) This work constructs a high-quality dataset, Square-10M, performs full open-source data collection, and generates it using innovative build links; 2) The dataset makes current open source models better on a variety of benchmarks, some of which are comparable to closed-source multimodal macromodels 3) The correlation between data size, loss of convergence, and model performance for text-centered VQA instruction tuning is demonstrated through thorough experiments
1) The construction of the dataset in this work relied too much on the Gemini model and did not demonstrate the effectiveness of this construction logic on other models; 2) The construction step of the dataset, the Square strategy, did not strike the reviewers as novel, preferring it to be a variant step similar to Chain of Thoughts; 3) The magnitude of the dataset is relatively large, and the balance between the overhead of the construction costs and the benefits derived is open to discussion.
* Improving the OCR capabilities of MLLMs is a relevant research direction given the current real-world applications of such models. * The paper is clear and easy to follow. * The efficacy of the dataset in finetuning MLLMs for text-centric VQA is thoroughly evaluated on a comprehensive suite of benchmarks. * The paper includes a comprehensive set of ablations on different subsets of the collected/generated data.
* The novelty of the paper is limited. For instance, previous works have also distilled proprietary MLLMs into large-scale datasets to finetune open-source models (e.g. [1]). * Proprietary models are used to generate synthetic data to finetune open-source models. This means the performance of finetuned models is upper bounded by that of proprietary models. In cases where TextSquare's performance surpasses that of proprietary models might be just due to better prompting. * Eliciting a rationale a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline and Blended Learning · Intelligent Tutoring Systems and Adaptive Learning
