TL;DR
This paper introduces $A^2R^2$, a novel framework that enhances Img2LaTeX conversion by integrating visual reasoning with attention-guided iterative refinement, significantly improving accuracy on challenging datasets.
Contribution
The paper proposes a new attention-guided refinement framework for Img2LaTeX that enables self-correction and iterative improvement, along with a challenging new dataset for evaluation.
Findings
Significant performance improvements across multiple metrics
Performance gains increase with more inference rounds
Effective ablation results confirming component contributions
Abstract
Img2LaTeX is a practically important task that involves translating mathematical expressions and structured visual content from images into LaTeX code. In recent years, vision-language models (VLMs) have achieved remarkable progress across a range of visual understanding tasks, largely due to their strong generalization capabilities. However, despite initial efforts to apply VLMs to the Img2LaTeX task, their performance remains suboptimal. Empirical evidence shows that VLMs can be challenged by fine-grained visual elements, such as subscripts and superscripts in mathematical expressions, which results in inaccurate LaTeX generation. To address this challenge, we propose : Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed A²R² framework is well-designed and conceptually elegant. The idea of "closing the loop" by having the model render its own output and visually compare it against the input is a powerful form of self-verification. 2. The paper presents compelling evidence for the effectiveness of A²R². The framework consistently and significantly outperforms all baselines across two different model architectures and scales. 3. The introduction of the Img2LaTeX-Hard-1K dataset is a significant
1. The primary drawback of the A²R² framework is its significant computational cost. Each refinement cycle involves multiple VLM inference calls (Comparison, Verification, Refinement) plus the overhead of an external rendering tool. While the authors cap the rounds at two for a fair comparison with Best-of-N=8 in terms of token count, the sequential nature of the A²R² loop will inevitably lead to much higher wall-clock latency. 2. The paper frames A²R² as a purely inference-time, training-free
Clear writing and presentation. The paper is well-organized, with high-quality figures and clear explanations that make the method easy to follow. Demonstrated effectiveness. Results on the proposed benchmark show that the approach consistently improves performance. Comprehensive analysis. The experimental section includes detailed ablations and analysis.
Limited novelty. The render–compare–refine pipeline is intuitive and analogous to well-known strategies in text-to-image and code-generation settings. While practical, it offers limited conceptual novelty and lacks deeper algorithmic insight. Without a training component, the method reads as an incremental engineering improvement. High inference-time overhead. Although training-free, the method introduces multiple iterative steps, which could lead to substantial test-time cost. A comparison of
1. The paper focuses on the relatively niche yet technically challenging **Img2LaTeX** problem. This clear and well-scoped objective allows for a deep exploration of visual–symbolic reasoning within a specific and meaningful application domain. 2. The release of **Img2LaTeX-Hard-1K**, a dataset specifically curated for **hard and visually complex mathematical expressions**, provides a **valuable new benchmark** for future research in **formula recognition** and **symbolic reasoning**.
1. **Potentially Misleading Terminology (“Attention-Guided Refinement”)** The term *“Attention-Guided Refinement”* is somewhat misleading, as the proposed framework does **not include any actual attention-layer modeling**. Instead, it relies purely on **semantic-level comparison** between the generated LaTeX and the rendered image. A more precise term would better reflect the underlying mechanism and avoid confusion with attention-based architectures. 2. **Lack of Novel Data Cont
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
