STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
Alessio Sordo, Lingxiao Du, Meeka-Hanna Lenisa, Evgeny Bogdanov, Maxim Romanovsky

TL;DR
STELLAR-E is an automated system that generates synthetic datasets for evaluating large language models across multiple domains and languages, reducing reliance on manual data collection.
Contribution
It introduces a fully automated, controllable synthetic data generation framework and an evaluation pipeline that improves scalability and domain adaptability for LLM assessment.
Findings
Synthetic datasets achieved +5.7% LLM-as-a-judge score difference compared to existing benchmarks.
The system enables high-quality, scalable, and domain-specific dataset creation with minimal human input.
It provides a faster, automated alternative to manual dataset collection for LLM evaluation.
Abstract
The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
