MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Zhongpu Chen; Yinfeng Liu; Long Shi; Xingyan Chen; Yu Zhao; Fuji Ren

arXiv:2501.15000·cs.CL·August 28, 2025

MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren

PDF

Open Access 1 Repo

TL;DR

This paper introduces MDEval, a benchmark dataset and evaluation method for assessing Markdown Awareness in large language models, improving interpretability and correlating well with human judgment.

Contribution

We present MDEval, a new benchmark with a large dataset and interpretability-focused metrics for evaluating Markdown Awareness in LLMs, outperforming existing methods.

Findings

01

MDEval achieves a Spearman correlation of 0.791 with human judgment.

02

MDEval's accuracy reaches 84.1%, surpassing existing evaluation methods.

03

Fine-tuning models on MDEval improves their Markdown Awareness to near GPT-4 levels.

Abstract

Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swufe-db-group/mdeval-benchmark
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsFocus