A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Alicia Sagae; Chia-Jung Lee; Sandeep Avula; Brandon Dang; Vanessa Murdock

arXiv:2510.20782·cs.CL·October 24, 2025

A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, Vanessa Murdock

PDF

TL;DR

This paper introduces a specialized dataset for evaluating responsible AI dimensions in LLMs within a real-world application context, focusing on fairness, quality, veracity, and safety.

Contribution

It presents a new dataset tailored to measure responsible AI aspects in LLMs for product description generation, addressing application-specific fairness and other ethical considerations.

Findings

01

Identifies fairness gaps in LLM-generated product descriptions.

02

Provides a dataset for evaluating LLMs on multiple responsible AI dimensions.

03

Offers a framework for targeted LLM evaluation in real-world tasks.

Abstract

Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.