DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant
Lev Sorokin, Ivan Vasilev, Samuele Pasini

TL;DR
This paper reports on the first LLM testing competition focused on benchmarking automotive assistants, evaluating tools on their ability to identify failures in car manual information retrieval.
Contribution
It introduces a new benchmarking competition for LLM-based automotive assistants and evaluates different tools' effectiveness in failure detection.
Findings
Tools varied in effectiveness at exposing failures.
Diversity of failure-revealing tests was a key evaluation metric.
The competition provided insights into LLM robustness in automotive contexts.
Abstract
This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
