MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning

Zhixi Cai; Fucai Ke; Kevin Leo; Sukai Huang; Maria Garcia de la Banda; Peter J. Stuckey; Hamid Rezatofighi

arXiv:2601.19204·cs.AI·January 28, 2026

MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning

Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi

PDF

Open Access 3 Reviews

TL;DR

MATA introduces a hierarchical multi-agent system with trainable policies for visual reasoning, improving interpretability and performance on complex queries by dynamically selecting specialized agents.

Contribution

This work presents MATA, a novel trainable hierarchical automaton system that enables dynamic agent collaboration and competition for improved visual reasoning.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Provides transparent execution history through shared memory.

03

Effectively trains a hyper agent to select optimal sub-agents.

Abstract

Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition-trajectory trees and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- **Clear motivation**: The paper addresses an important limitation in current multi-agent systems: leveraging the power of multiple agents typically requires manual pipelining, which becomes unwieldy as task complexity grows. The proposal to learn a hyper-policy for agent selection is both reasonable and interesting. - **Principled and extensible design**: MATA’s architecture is well aligned with its motivation. It is technically sound and, importantly, not narrowly restricted to the specific v

Weaknesses

### Major - **Limited applicability**: As noted in the limitations, the use of only three agents is a restricted setting. There is also a lack of detail regarding how these three agents were selected and the rationale behind their design. - **Unclear attribution of performance gains**: It is not clear whether the learned state transition policy is truly responsible for the observed performance improvements. For example, if all three agents were simply called exhaustively, would performance impr

Reviewer 02Rating 6Confidence 3

Strengths

– **Clear conceptual motivation:** The paper identifies an important limitation of existing VLMs and compositional systems that the lack of a learned, flexible orchestration mechanism among reasoning agents and recasts it elegantly as a finite-state automaton control problem. – **Novel hierarchical formulation:** Treating each agent as a sub-automaton and learning high-level transitions through a hyper-agent is a conceptually clean, interpretable design that unifies rule-based micro-control wit

Weaknesses

– **Incremental algorithmic novelty:** While the integration is elegant, many components (agent orchestration, SFT, trajectory trees) extend known concepts from HYDRA and NAVER. The work’s originality lies more in *system design* than in theoretical innovation. – **Limited discussion of scalability:** The near-exhaustive transition expansion is tractable for 3 agents but may explode combinatorially as more states are added. – **Computational cost analysis:** Wall-clock training times and GPU u

Reviewer 03Rating 2Confidence 3

Strengths

- The combination of trainable high-level transitions with rule-based sub-automata is elegant, focusing learning on the ambiguous agent selection problem while preserving reliable execution within agents. - The paper provides experiments across multiple benchmarks (VQA and visual grounding), demonstrating consistent improvements over the base model. - The transition trajectory tree expansion provides a principled approach to generating supervision for the hyper agent, though scalability concerns

Weaknesses

- The gains appear modest (e.g., 75.2% base internvl25 used as vlm vs 76.5% theirs on AOKVQA) considering the 90K in-domain training examples generated. The paper doesn't isolate whether improvements come from multi-agent collaboration or simply additional task-specific training data. - Table 5 reveals that removing SFT causes performance to drop below the base internvl25 model, suggesting the architecture itself may be detrimental without training. A crucial missing experiment is training a mon

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Constraint Satisfaction and Optimization