How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
Hamidreza Saghir

TL;DR
This paper introduces a two-pathway framework for understanding how language models detect out-of-distribution inputs, distinguishing between content-based and processing-based signals, and reveals the limitations of existing confidence scores.
Contribution
It proposes a novel two-pathway framework separating embedding content from processing trajectory, improving OOD detection and deconfounding length-related confounds in LLMs.
Findings
Embedding methods excel on vocabulary-distinctive OOD tasks.
Trajectory features effectively detect covert-intent inputs.
Attention circuits are more engaged in adversarial tasks.
Abstract
Recent white-box OOD detection methods for LLMs -- including CED, RAUQ, and WildGuard confidence scores -- appear effective, but we show they are structurally confounded by sequence length (|r| >= 0.61) and collapse to near-chance under length-matched evaluation. Even raw attention entropy (mean H(alpha) across heads and layers), a natural baseline we include for completeness, shows the same confound. The confound stems from attention's Theta(log T) dependence on input length. To identify genuine OOD signals after deconfounding, we propose a two-pathway framework: embeddings capture what text is about (effective for topic shifts), while the processing trajectory -- hidden-state evolution across layers -- captures how the model processes input. The relative power of each pathway varies along a vocabulary-transparency spectrum: embedding methods excel on vocabulary-distinctive OOD, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
