Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

Qihua Dong; Ruozhen He; Junwen Chen; Yizhou Wang; Xu Ma; Songyao Jiang; Yun Fu

arXiv:2605.04304·cs.CV·May 7, 2026

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

Qihua Dong, Ruozhen He, Junwen Chen, Yizhou Wang, Xu Ma, Songyao Jiang, Yun Fu

PDF

TL;DR

The paper introduces HierVA, a hierarchical visual agent framework that improves chart reasoning by managing visual and textual contexts separately and iteratively constructing a compact working context for multi-step reasoning.

Contribution

HierVA is a novel hierarchical framework that enhances multi-step chart reasoning by maintaining separate visual and textual contexts with a planning and reasoning structure.

Findings

01

HierVA outperforms strong multimodal baselines on CharXiv reasoning subset.

02

Ablation studies show hierarchical architecture and scoped context improve performance.

03

Using a zoom-in tool effectively restricts visual context for better reasoning.

Abstract

Advanced chart question answering requires both precise perception of small visual elements and multi-step reasoning across several subplots. While existing MLLMs are strong at understanding single plots, they often struggle with multi-step reasoning across multiple subplots. We propose HierVA, a hierarchical visual agent framework for chart reasoning that iteratively constructs and updates a working context in a joint image--text space. A high-level manager generates plans and maintains a compact context containing only key information, while specialized workers perform reasoning, gather evidence, and return results. In particular, the agent maintains separate visual and textual contexts, using a zoom-in tool to restrict the visual context. Experiments on the CharXiv reasoning subset demonstrate consistent improvements over strong multimodal baselines, and ablation studies verify that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.