ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim; Junwoo Ha; Sangyoon Yu; Haon Park

arXiv:2508.16889·cs.CL·October 10, 2025

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park

PDF

3 Reviews

TL;DR

ObjexMT introduces a benchmark for evaluating large language models' ability to extract objectives and calibrate confidence in multi-turn conversations, revealing significant challenges and variability across models and datasets.

Contribution

This work presents ObjexMT, a novel benchmark for assessing LLMs' objective extraction and metacognitive calibration in complex multi-turn dialogues, highlighting current limitations.

Findings

01

Kimi-k2 achieves the highest objective-extraction accuracy (0.612).

02

Claude-sonnet-4 offers the best calibration metrics.

03

Performance varies significantly across datasets, with accuracy ranging from 16% to 82%.

Abstract

LLM-as-a-Judge (LLMaaJ) enables scalable evaluation, yet we lack a decisive test of a judge's qualification: can it recover the hidden objective of a conversation and know when that inference is reliable? Large language models degrade with irrelevant or lengthy context, and multi-turn jailbreaks can scatter goals across turns. We present ObjexMT, a benchmark for objective extraction and metacognition. Given a multi-turn transcript, a model must output a one-sentence base objective and a self-reported confidence. Accuracy is scored by semantic similarity to gold objectives, then thresholded once on 300 calibration items ( $τ^{⋆} = 0.66$ ; $F_{1} @ τ^{⋆} = 0.891$ ). Metacognition is assessed with expected calibration error, Brier score, Wrong@High-Confidence (0.80 / 0.90 / 0.95), and risk--coverage curves. Across six models (gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The paper examines the important topic of using LLMs as judges in safety evaluations. - ObjexMT addresses a gap in existing jailbreaking benchmarks by extracting a single sentence describing the conversation's objective along with a confidence score. - Some measures used for the confidence score are intuitive, such as [email protected] confidence. - The data sets have different levels of obfuscation.

Weaknesses

- The semantic similarity portion was not fully discussed. Do the experts label semantic similarity and gold objectives independently? - The wide range of accuracy values was not discussed in detail. What are possible explanations? Would ablation studies help clarify this issue? - Since the paper does not present any theoretical results, I am not sure how to interpret "claude-sonnet-4 yields the best selective risk and calibration," except that in these experiments it was the "winner." What ma

Reviewer 02Rating 4Confidence 4

Strengths

1. Addresses a critical and timely problem. As LLM-as-a-Judge systems become increasingly deployed in production environments, evaluating their reliability for latent objective inference is essential for AI safety applications. 2. Novel dual-evaluation paradigm. Jointly measuring extraction accuracy and confidence calibration is methodologically innovative and practically valuable. The framework recognizes that opaque judges must signal their own trustworthiness. 3. Comprehensive calibration a

Weaknesses

1. Threshold optimization lacks proper validation. Selecting τ*=0.66 from 101 candidates on the same 300 samples used to report F1=0.891 constitutes a multiple-comparison problem without correction. This risks overfitting; the true generalization performance on held-out data is unknown. The paper needs train-validation splits or cross-validation. 2. Core premise lacks direct empirical validation. The paper claims harmfulness detection differs from intent extraction but only cites prior work wit

Reviewer 03Rating 4Confidence 3

Strengths

1. ObjexMT tackles a critical challenge in AI safety by formalizing latent objective extraction and confidence calibration. 2. It evaluates six widely used LLMs across diverse datasets, offering valuable insights into model performance and calibration across varying conditions.

Weaknesses

1. Although the paper highlights high-confidence errors, it lacks a detailed taxonomy of failure cases. 2. The writing is difficult to follow, which may hinder understanding. 3. The experiments explore limited methods of self-reporting confidence, making the results less convincing. 4. The evaluation is restricted to six large commercial LLMs, excluding smaller open-source models and safety-tuned variants, which limits the generalizability of the findings. 5. The benchmark's single-sentence

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.