TL;DR
ObjexMT introduces a benchmark for evaluating large language models' ability to extract objectives and calibrate confidence in multi-turn conversations, revealing significant challenges and variability across models and datasets.
Contribution
This work presents ObjexMT, a novel benchmark for assessing LLMs' objective extraction and metacognitive calibration in complex multi-turn dialogues, highlighting current limitations.
Findings
Kimi-k2 achieves the highest objective-extraction accuracy (0.612).
Claude-sonnet-4 offers the best calibration metrics.
Performance varies significantly across datasets, with accuracy ranging from 16% to 82%.
Abstract
LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items (; ). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2,…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper examines the important topic of using LLMs as judges in safety evaluations. - ObjexMT addresses a gap in existing jailbreaking benchmarks by extracting a single sentence describing the conversation's objective along with a confidence score. - Some measures used for the confidence score are intuitive, such as [email protected] confidence. - The data sets have different levels of obfuscation.
- The semantic similarity portion was not fully discussed. Do the experts label semantic similarity and gold objectives independently? - The wide range of accuracy values was not discussed in detail. What are possible explanations? Would ablation studies help clarify this issue? - Since the paper does not present any theoretical results, I am not sure how to interpret "claude-sonnet-4 yields the best selective risk and calibration," except that in these experiments it was the "winner." What ma
1. Addresses a critical and timely problem. As LLM-as-a-Judge systems become increasingly deployed in production environments, evaluating their reliability for latent objective inference is essential for AI safety applications. 2. Novel dual-evaluation paradigm. Jointly measuring extraction accuracy and confidence calibration is methodologically innovative and practically valuable. The framework recognizes that opaque judges must signal their own trustworthiness. 3. Comprehensive calibration a
1. Threshold optimization lacks proper validation. Selecting τ*=0.66 from 101 candidates on the same 300 samples used to report F1=0.891 constitutes a multiple-comparison problem without correction. This risks overfitting; the true generalization performance on held-out data is unknown. The paper needs train-validation splits or cross-validation. 2. Core premise lacks direct empirical validation. The paper claims harmfulness detection differs from intent extraction but only cites prior work wit
1. ObjexMT tackles a critical challenge in AI safety by formalizing latent objective extraction and confidence calibration. 2. It evaluates six widely used LLMs across diverse datasets, offering valuable insights into model performance and calibration across varying conditions.
1. Although the paper highlights high-confidence errors, it lacks a detailed taxonomy of failure cases. 2. The writing is difficult to follow, which may hinder understanding. 3. The experiments explore limited methods of self-reporting confidence, making the results less convincing. 4. The evaluation is restricted to six large commercial LLMs, excluding smaller open-source models and safety-tuned variants, which limits the generalizability of the findings. 5. The benchmark's single-sentence
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
