SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation
Xiaoyuan Li, Yuzhe Wang, Moxin Li, Keqin Bao, Rui Men, Yichang Zhang, Dayiheng Liu, Wenjie Wang, Fuli Feng

TL;DR
SAGE is a scalable framework that uses fine-tuned smaller models to automatically generate and verify robustness variants for LLM knowledge benchmarks, improving scalability and cost-effectiveness.
Contribution
It introduces VariantQual and VariantGen, enabling scalable, automated robustness augmentation of benchmarks with quality comparable to human annotations.
Findings
SAGE constructs large-scale robustness benchmarks at lower cost.
Models trained with SAGE generalize to MMLU without specific fine-tuning.
SAGE achieves quality comparable to human-annotated benchmarks.
Abstract
Large Language Models (LLMs) achieve strong performance on standard knowledge evaluation benchmarks, yet recent work shows that their knowledge capabilities remain brittle under question variants that test the same knowledge in different forms. Robustness augmentation of existing knowledge evaluation benchmarks is therefore necessary, but current LLM-assisted generate-then-verify pipelines are costly and difficult to scale due to low-yield variant generation and unreliable variant verification. We propose SAGE (Scalable Automated Generation of Robustness BEnchmarks), a framework for scalable robustness augmentation of knowledge evaluation benchmarks using fine-tuned smaller models. SAGE consists of VariantQual, a rubric-based verifier trained on human-labeled seed data, and VariantGen, a variant generator initialized with supervised fine-tuning and further optimized with reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
