Can Large Language Models Unveil the Mysteries? An Exploration of Their   Ability to Unlock Information in Complex Scenarios

Chao Wang; Luning Zhang; Zheng Wang; Yang Zhou

arXiv:2502.19973·cs.CV·March 11, 2025

Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios

Chao Wang, Luning Zhang, Zheng Wang, Yang Zhou

PDF

TL;DR

This paper evaluates large language models' ability to perform combinatorial reasoning across multiple perceptual inputs in complex scenarios, introducing new benchmarks and methods to improve their reasoning capabilities.

Contribution

It introduces two novel benchmarks for assessing combinatorial reasoning in multi-modal models and proposes approaches that significantly enhance model performance on these tasks.

Findings

01

Current models perform poorly on combinatorial reasoning benchmarks.

02

State-of-the-art models achieve only 33.04% accuracy on CVQA.

03

Proposed methods improve performance by over 22% on CVQA.

Abstract

Combining multiple perceptual inputs and performing combinatorial reasoning in complex scenarios is a sophisticated cognitive function in humans. With advancements in multi-modal large language models, recent benchmarks tend to evaluate visual understanding across multiple images. However, they often overlook the necessity of combinatorial reasoning across multiple perceptual information. To explore the ability of advanced models to integrate multiple perceptual inputs for combinatorial reasoning in complex scenarios, we introduce two benchmarks: Clue-Visual Question Answering (CVQA), with three task types to assess visual comprehension and synthesis, and Clue of Password-Visual Question Answering (CPVQA), with two task types focused on accurate interpretation and application of visual data. For our benchmarks, we present three plug-and-play approaches: utilizing model input for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.