Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Vincent Koc

arXiv:2505.12058·cs.AI·May 20, 2025

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Vincent Koc

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Tiny QA Benchmark++ offers a fast, multilingual, synthetic dataset generator and smoke-test suite for continuous LLM evaluation, enabling quick safety checks and prompt optimization with minimal cost and latency.

Contribution

It introduces an ultra-lightweight, multilingual smoke-test suite and synthetic data generator that integrates seamlessly into LLM pipelines for rapid, resource-efficient quality assurance.

Findings

01

Runs in seconds with minimal cost

02

Effectively flags prompt-template errors and tokenizer drift

03

Supports multiple languages and easy integration

Abstract

Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vincentkoc/tiny_qa_benchmark_pp
noneOfficial

Datasets

vincentkoc/tiny_qa_benchmark_pp
dataset· 290 dl
290 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Adversarial Robustness in Machine Learning · Security and Verification in Computing

MethodsSparse Evolutionary Training