MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren

TL;DR
This paper introduces MDEval, a benchmark dataset and evaluation method for assessing Markdown Awareness in large language models, improving interpretability and correlating well with human judgment.
Contribution
We present MDEval, a new benchmark with a large dataset and interpretability-focused metrics for evaluating Markdown Awareness in LLMs, outperforming existing methods.
Findings
MDEval achieves a Spearman correlation of 0.791 with human judgment.
MDEval's accuracy reaches 84.1%, surpassing existing evaluation methods.
Fine-tuning models on MDEval improves their Markdown Awareness to near GPT-4 levels.
Abstract
Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFocus
