Towards Effective Theory of LLMs: A Representation Learning Approach
Muhammed Ustaomeroglu, Guannan Qu

TL;DR
This paper introduces RET, a framework that learns high-level macrovariables from LLM hidden states, enabling better interpretability, prediction, and control of model behavior.
Contribution
RET provides a novel self-supervised method to extract meaningful macrovariables from LLMs, enhancing interpretability and enabling causal interventions.
Findings
RET yields temporally consistent 'mental-state' trajectories.
Macrovariables capture high-level semantic structures.
RET supports early prediction of behavioral outcomes.
Abstract
We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
