BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task
Tosin Adewumi, Martin Karlsson, Lama Alkhaled, Marcus Liwicki

TL;DR
This paper introduces BatteryPass-12K, the first synthetic dataset for digital battery passport conformance classification, evaluates 22 language models on the task, and analyzes their performance and vulnerabilities.
Contribution
It presents the first public benchmark dataset for DBP conformance, evaluates diverse language models, and provides insights into model performance and robustness in this domain.
Findings
Thinking models like GPT-5.4 perform best with high F1 scores.
Few-shot examples significantly improve model performance.
Prompt-injection attacks reduce model accuracy.
Abstract
We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
