A Neurosymbolic Agent System for Compositional Visual Reasoning
Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

TL;DR
VLAgent is a neuro-symbolic system that enhances compositional visual reasoning by combining interpretable planning, syntax checking, and stepwise verification, outperforming existing models on multiple benchmarks.
Contribution
The paper introduces VLAgent, a novel neuro-symbolic framework with a structured reasoning pipeline, syntax correction, and verification, advancing compositional visual reasoning capabilities.
Findings
VLAgent outperforms state-of-the-art models on six visual reasoning benchmarks.
The system effectively detects and repairs logic errors in reasoning plans.
Stepwise verification improves reasoning accuracy and robustness.
Abstract
The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal vision-language reasoning capabilities. However, existing vision-language models (VLMs) remain challenged by compositional visual reasoning. This paper presents VLAgent, a neuro-symbolic approach to developing a Vision-Language Agent system for efficient compositional visual reasoning with three novel features. First, VLAgent develops an interpretable visualization-enhanced two-stage neuro-symbolic reasoning system. The first stage is managed by a front-end engine that generates a structured visual reasoning plan (symbolic program script) for each compositional visual reasoning task by utilizing a pre-trained LLM powered with few-shot chain-of-thought in-context learning. The second stage is managed by a high-performance back-end engine. It transforms the planning script…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Detailed evaluation. compares VLAgent with representative image QA approaches in VLMs category and zeroshot methods on 4 popular ImageQA benchmarks. We next evaluate the generalization capability of VLAgent on two recent video benchmarks. detailed ablation study. The paper conducts a comprehensive evaluation of VLAgent across multiple dimensions. First, VLAgent is compared against representative image QA approaches on four popular ImageQA benchmarks. Subsequently, the model's generalization c
### Clarity & Motivation - The paper's organization hinders comprehension. The introduction primarily lists limitations and corresponding solutions without adequately explaining the underlying motivation—specifically, why the proposed two-stage framework and output verifier are effective in addressing the stated problems. This lack of conceptual linkage makes the argument feel incoherent. - Similarly, the abstract enumerates contributions but fails to articulate their core motivation. - Figur
- The paper introduces a well-structured neuro-symbolic framework that clearly separates planning and execution stages. This design improves modularity and interpretability, allowing each stage to be analyzed, debugged, and extended independently. - The proposed SS-Parser effectively detects and repairs syntax and logic errors in LLM-generated plans, addressing a common failure mode of LLM-based program generation, - The model achieves strong zero-shot performance across multiple visual reason
- The multi-stage architecture introduces additional computational and implementation complexity. Each step of planning, parsing, repairing, and verifying adds latency and engineering cost, which may limit real-time or large-scale deployment. Additional clarification on the computational cost would enhance the paper's clarity. - The evaluation is mainly limited to visual QA tasks, which restricts the demonstrated generality of the framework. Broader reasoning domains (e.g., robotics, text–image
The general idea of a neuro-symbolic agent approach is valuable for the community also the experimental evidence seems to show improvements over baselines.
However I find it very difficult to assess what exactly the core contribution or claim of this work is. The proposed VLAgent seems to incorporate so many different module that aren't properly described how they work overall, but also how they are implemented. This makes it quite hard to assess the significance of the results. E.g., if there are potentailly so many different models within VLAgent is the comparison to the other baselines fair? Perhaps the authors could specificy again what the cor
- The authors propose a method for compositional visual reasoning that operates on images as well as videos, which shows improvements over existing methods. The method provides reasoning traces that improve the interpretability of the final output. - The two-stage pipeline and the introduction of an SS-Parser and verifier to check and refine outputs are interesting and relevant contributions. - The proposed ensemble pruning is an interesting approach for robustness when dealing with multiple
- My main concern is the presentation of the method. It is very hard to follow the different modules of the method and especially understanding the architecture, as the methodology mainly gives insights on what the modules are supposed to do without going into technical details how exactly it is done, at least on a formal level (what are inputs and outputs of the modules). - The front-end part is only very briefly explained. There are no additional details on the task dispatcher, task-specific
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems
