MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning
Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi

TL;DR
MATA introduces a hierarchical multi-agent system with trainable policies for visual reasoning, improving interpretability and performance on complex queries by dynamically selecting specialized agents.
Contribution
This work presents MATA, a novel trainable hierarchical automaton system that enables dynamic agent collaboration and competition for improved visual reasoning.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Provides transparent execution history through shared memory.
Effectively trains a hyper agent to select optimal sub-agents.
Abstract
Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition-trajectory trees and…
Peer Reviews
Decision·ICLR 2026 Poster
- **Clear motivation**: The paper addresses an important limitation in current multi-agent systems: leveraging the power of multiple agents typically requires manual pipelining, which becomes unwieldy as task complexity grows. The proposal to learn a hyper-policy for agent selection is both reasonable and interesting. - **Principled and extensible design**: MATA’s architecture is well aligned with its motivation. It is technically sound and, importantly, not narrowly restricted to the specific v
### Major - **Limited applicability**: As noted in the limitations, the use of only three agents is a restricted setting. There is also a lack of detail regarding how these three agents were selected and the rationale behind their design. - **Unclear attribution of performance gains**: It is not clear whether the learned state transition policy is truly responsible for the observed performance improvements. For example, if all three agents were simply called exhaustively, would performance impr
– **Clear conceptual motivation:** The paper identifies an important limitation of existing VLMs and compositional systems that the lack of a learned, flexible orchestration mechanism among reasoning agents and recasts it elegantly as a finite-state automaton control problem. – **Novel hierarchical formulation:** Treating each agent as a sub-automaton and learning high-level transitions through a hyper-agent is a conceptually clean, interpretable design that unifies rule-based micro-control wit
– **Incremental algorithmic novelty:** While the integration is elegant, many components (agent orchestration, SFT, trajectory trees) extend known concepts from HYDRA and NAVER. The work’s originality lies more in *system design* than in theoretical innovation. – **Limited discussion of scalability:** The near-exhaustive transition expansion is tractable for 3 agents but may explode combinatorially as more states are added. – **Computational cost analysis:** Wall-clock training times and GPU u
- The combination of trainable high-level transitions with rule-based sub-automata is elegant, focusing learning on the ambiguous agent selection problem while preserving reliable execution within agents. - The paper provides experiments across multiple benchmarks (VQA and visual grounding), demonstrating consistent improvements over the base model. - The transition trajectory tree expansion provides a principled approach to generating supervision for the hyper agent, though scalability concerns
- The gains appear modest (e.g., 75.2% base internvl25 used as vlm vs 76.5% theirs on AOKVQA) considering the 90K in-domain training examples generated. The paper doesn't isolate whether improvements come from multi-agent collaboration or simply additional task-specific training data. - Table 5 reveals that removing SFT causes performance to drop below the base internvl25 model, suggesting the architecture itself may be detrimental without training. A crucial missing experiment is training a mon
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Constraint Satisfaction and Optimization
