How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

Chenchen Kuai; Jiwan Jiang; Zihao Zhu; Hao Wang; Keshu Wu; Zihao Li; Yunlong Zhang; Chenxi Liu; Zhengzhong Tu; Zhiwen Fan; Yang Zhou

arXiv:2604.07650·cs.AI·April 10, 2026

How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles

Chenchen Kuai, Jiwan Jiang, Zihao Zhu, Hao Wang, Keshu Wu, Zihao Li, Yunlong Zhang, Chenxi Liu, Zhengzhong Tu, Zhiwen Fan, Yang Zhou

PDF

TL;DR

This paper introduces a statistical framework to audit behavioral dependencies among large language models, revealing widespread entanglement that impacts ensemble verification and proposing a reweighting method to mitigate bias.

Contribution

It develops a novel multi-resolution statistical approach to quantify and analyze behavioral entanglement in black-box LLMs, with practical reweighting techniques to improve ensemble reliability.

Findings

01

Widespread behavioral entanglement identified among 18 LLMs.

02

CIG metric correlates significantly with decreased judge precision.

03

Reweighting based on independence inference improves verification accuracy by up to 4.5%.

Abstract

The rapid growth of the large language model (LLM) ecosystem raises a critical question: are seemingly diverse models truly independent? Shared pretraining data, distillation, and alignment pipelines can induce hidden behavioral dependencies, latent entanglement, that undermine multi-model systems such as LLM-as-a-judge pipelines and ensemble verification, which implicitly assume independent signals. In practice, this manifests as correlated reasoning patterns and synchronized failures, where apparent agreement reflects shared error modes rather than independent validation. To address this, we develop a statistical framework for auditing behavioral entanglement among black-box LLMs. Our approach introduces a multi-resolution hierarchy that characterizes the joint failure manifold through two information-theoretic metrics: (i) a Difficulty-Weighted Behavioral Entanglement Index, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.