Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
GAng Peng

TL;DR
This paper introduces a framework for evaluating how well large language models preserve user intent across different semantic dimensions, revealing systematic fidelity issues not captured by holistic scores.
Contribution
It proposes a novel dimension-level intent fidelity evaluation method using structured prompt ablation across multiple languages, tasks, and models, providing deeper insights into model performance.
Findings
Significant discrepancy between holistic scores and intent fidelity in LLM outputs.
Human evaluation confirms that intent fidelity scores align better with perceived quality.
Moderate misalignments are absorbed, but severe inversions harm model performance.
Abstract
Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
