CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

Haoming Meng

arXiv:2604.23455·cs.SE·May 4, 2026

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend

Haoming Meng

PDF

TL;DR

CUJBench is a novel benchmark that evaluates cross-modal failure diagnosis combining browser and backend signals, revealing current models' limitations in synthesizing multi-modal evidence effectively.

Contribution

This work introduces CUJBench, the first benchmark for cross-modal failure diagnosis, with an LLM-assisted annotation pipeline and comprehensive evaluation of state-of-the-art models.

Findings

01

Models achieve only 19.7% accuracy, far below the 52% ceiling.

02

Browser-only agents outperform full-toolset agents in aggregate.

03

Cross-modal synthesis is identified as the main bottleneck in failure diagnosis.

Abstract

Automated failure diagnosis requires correlating browser-visible symptoms with backend observability signals, yet existing benchmarks do not evaluate this cross-modal reasoning task. Constructing one is non-trivial: multi-modal failure scenarios are costly to annotate, and live-environment capture introduces stochasticity that makes cross-run agent comparison unreliable. We present CUJBench, to our knowledge, the first benchmark to combine browser-visible failure evidence with backend observability in a diagnostic framing. CUJBench addresses annotation cost through an LLM-assisted generation pipeline with a multi-agent review loop and a three-layer annotation scheme, producing 87 labeled scenarios across five fault families, and ensures reproducibility by packaging each failure as a deterministic multi-modal snapshot with a fixed tool interface. Evaluating six frontier models under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.