TL;DR
This paper introduces Visual Para-Thinker, a novel parallel reasoning framework for multimodal large language models that enhances visual comprehension through divide-and-conquer strategies, demonstrating improved performance on benchmark datasets.
Contribution
It pioneers the application of parallel reasoning strategies to the visual domain, integrating new attention mechanisms and a native multimodal implementation.
Findings
Achieves state-of-the-art results on V*, CountBench, RefCOCO, HallusionBench datasets.
Demonstrates that parallel reasoning improves visual comprehension and reasoning diversity.
Validates the effectiveness of the proposed framework through empirical experiments.
Abstract
Existing LLM test-time scaling laws emphasize the emergence of self-reflective behaviors through extended reasoning length. Nevertheless, this vertical scaling strategy often encounters plateaus in exploration as the model becomes locked into specific thinking pattern. By shifting from depth to parallelism, parallel thinking mitigates the narrowing of exploration. However, the extension of this paradigm to visual domain remains an open research question. In this paper, we first examine the role of visual partitioning in parallelized reasoning and subsequently propose two distinct strategies. Based on the above, we introduce Visual Para-Thinker, representing the inaugural parallel reasoning framework for MLLMs. To maintain path independence and promote diversity in reasoning, our approach integrates Pa-Attention alongside LPRoPE. Leveraging the vLLM framework, we have developed a native…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
