ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Kerry Luo; Michael Fu; Joshua Peguero; Husnain Malik; Anvay Patil; Joyce Lin; Megan Van Overborg; Ryan Sarmiento; Kevin Zhu

arXiv:2512.04125·cs.LG·December 5, 2025

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Kerry Luo, Michael Fu, Joshua Peguero, Husnain Malik, Anvay Patil, Joyce Lin, Megan Van Overborg, Ryan Sarmiento, Kevin Zhu

PDF

Open Access

TL;DR

ASCIIBench introduces a new benchmark dataset and evaluation framework for assessing language models' ability to understand and generate ASCII art, revealing current limitations in multimodal representations and emphasizing the need for specialized embedding methods.

Contribution

This paper presents the first publicly available ASCII art benchmark, along with a fine-tuned CLIP model, to evaluate and analyze LLMs' spatial reasoning and visual understanding capabilities.

Findings

01

Cosine similarity over CLIP embeddings often fails to distinguish ASCII categories.

02

High internal mean similarity correlates with better class discriminability.

03

ASCII art serves as a stress test for multimodal representation quality.

Abstract

Large language models (LLMs) have demonstrated several emergent behaviors with scale, including reasoning and fluency in long-form text generation. However, they continue to struggle with tasks requiring precise spatial and positional reasoning. ASCII art, a symbolic medium where characters encode structure and form, provides a unique probe of this limitation. We introduce ASCIIBench, a novel benchmark for evaluating both the generation and classification of ASCII-text images. ASCIIBench consists of a filtered dataset of 5,315 class-labeled ASCII images and is, to our knowledge, the first publicly available benchmark of its kind. Alongside the dataset, we release weights for a fine-tuned CLIP model adapted to capture ASCII structure, enabling the evaluation of LLM-generated ASCII art. Our analysis shows that cosine similarity over CLIP embeddings fails to separate most ASCII categories,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis