ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, Yang You, Guangtao Zhai

TL;DR
This paper introduces ASCIIEval, a benchmark for evaluating models' ability to perceive visual semantics in ASCII art, revealing strengths and limitations of current LLMs and MLLMs in recognizing and understanding ASCII-based concepts across modalities.
Contribution
We present ASCIIEval, a novel benchmark with extensive samples and analysis, to evaluate and compare models' visual perception of ASCII art in text and image modalities.
Findings
Proprietary models achieve over 70% accuracy on ASCII categories.
Open-source MLLMs show limited generalization and a 20.01% accuracy gap.
Model performance is sensitive to ASCII art length and modality fusion.
Abstract
Perceiving visual semantics embedded within consecutive characters is a crucial yet under-explored capability for both Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs). In this work, we select ASCII art as a representative artifact. It depicts concepts through careful arrangement of characters, which can be formulated in both text and image modalities. We frame the problem as a recognition task, and construct a novel benchmark, ASCIIEval. It covers over 3K samples with an elaborate categorization tree, along with a training set for further enhancement. Encompassing a comprehensive analysis of tens of models through different input modalities, our benchmark demonstrate its multi-faceted diagnostic power. Given textual input, language models shows their visual perception ability on ASCII art concepts. Proprietary models achieve over 70% accuracy on certain…
Peer Reviews
Decision·ICLR 2026 Poster
The main strengths of the paper are as follows: 1. A novel cross-modal benchmark that covers still underexpored domain (visual pattern recognition within text) is introduced. Prior research has heavily focused on reading text in images (OCR) or traditional image understanding, while ASCIIEval utilizes ASCII art as a modality-agnostic bridge between text and vision. 2. The authors created the benchmark under the comprehensive methodology, e.g., 3-layer category hierarchy of the tasks that enable
The main weaknesses of the paper are as follows: 1. While ASCII benchmark is interesting itself, it is still a little bit narrow. From the motivation of the benchmark creation, it is a bit unclear. what general insights ASCII art evaluation provides beyond this specific format. For example, how well this translates to broader model capabilities. Stronger correlations with general benchmarks would help clarify relevance. While, some notes about trade-off between the OCR performance and ASCII per
(1) Addresses an underexplored capability: visual perception in text strings; ASCII art is a strong, modality-agnostic testbed. (2) Carefully curated benchmark with taxonomy, safety filtering, human upper bound, and objective multiple-choice evaluation. (3) Broad, current evaluation with clear, actionable insights (OCR vs holistic trade-off; length effects; fusion failure) and simple, effective mitigations (low-res prompting; vision-backbone finetuning; rationale distillation).
(1) Ambiguity and label integrity: Although human filtering was applied, the paper acknowledges remaining ambiguity (<1.67%) and reports a relatively low accuracy (70%) for spot-checks in ASCIITune. More rigorous inter-annotator agreement (IAA), label adjudication protocols, and confusion analyses across similar concepts would strengthen trust in labels and distractors. (2) Potential source bias and reuse: The dataset draws heavily from online galleries with human-made ASCII, which is valuable
1.The paper is well written and easy to follow. 2.The authors do extensice experiments and test a wide range of different LLMs and MLLMs. 3.The problem of ASCII art for LLMs and MLLMs is interesting. 4.The work is solid. It not only constructs a high-quality test set (ASCIIEval) with a multi-layer categorization system but also provides a training set (ASCIITune) for enhancing model capabilities. The exhaustive evaluation of over 50 mainstream models across three modalities makes its conclusions
**1.limited techincal contribution**:While the research question is intriguing, the paper's technical contribution remains relatively modest. The proposed Rationale-Assisted Training approach essentially leverages GPT to construct chain-of-thought data, which can be viewed as a form of capability distillation from a more powerful model. **2.Failure to Fully Explore Model Robustness**: The paper only briefly touches upon font sensitivity and character perturbation analysis, which should have been
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
MethodsSparse Evolutionary Training
