
TL;DR
This paper explores how regulators must contain models of the systems they control, using algorithmic information theory to formalize this principle.
Contribution
The paper introduces a new framework using algorithmic complexity to prove that effective regulators reduce output complexity and contain models of the world.
Findings
A regulator reduces algorithmic complexity of the system output compared to an unregulated baseline.
Larger complexity gaps favor world-regulator pairs with high mutual algorithmic information.
The framework applies to individual sequences and complements the Internal Model Principle.
Abstract
The regulator theorem states that, under certain conditions, any optimal controller must embody a model of the system it regulates, grounding the idea that controllers embed, explicitly or implicitly, internal models of the controlled. This principle underpins neuroscience and predictive brain theories like the Free-Energy Principle or Kolmogorov/Algorithmic Agent theory. However, the theorem is only proven in limited settings. Here, we treat the deterministic, closed, coupled world-regulator system (W,R) as a single self-delimiting program p via a constant-size wrapper that produces the world output string x fed to the regulator. We analyze regulation from the viewpoint of the algorithmic complexity of the output, K(x) (regulation as compression). We define R to be a good algorithmic regulator if it reduces the algorithmic complexity of the readout relative to a null (unregulated)…
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Formal Methods in Verification · Embodied and Extended Cognition
1. Introduction
In the Kolmogorov Theory (KT) of consciousness, an algorithmic agent is a system that maintains (tele)homeostasis (persistence of self or kind) by learning and running succinct generative models of its world coupled to an objective function and a action planner [1,2,3]. Closely related, Active Inference (AIF) models biological agents as minimizing variational free energy under a generative model [4,5]. These frameworks suggest that “agents with world-modeling engines, objective functions, and planners” are natural minimal models of homeostasis (goal-conditioned setpoint control). But for the kinds of homeostatic systems we actually encounter in nature (cells, organisms, engineered servos), how can we tell—operationally—whether they are algorithmic agents in this sense?
Our results are explicitly distribution-free: we do not assume probability densities, and the relevant “model content” can be inferred from a single realized world–regulator instance via algorithmic mutual information. Closest in spirit are algorithmic statistics and MDL (two-part codes/individual-object sufficient statistics), which formalize “model-from-one-object” without an ensemble [6,7,8,9]. In control, “single-trajectory” data-driven methods can identify LTI dynamics from one sufficiently rich trajectory, but only under strong structural assumptions (e.g., LTI, controllability, persistency of excitation) [10]. To our knowledge, there is no analogous distribution-free result that derives a general necessity claim of internal-model content for arbitrary regulators from single-episode data without such structural assumptions; highlighting this gap makes the present AIT results all the more informative.
The classical cybernetics statement that “every good regulator of a system must be a model of that system” originates with Conant and Ashby’s 1970 paper (the Good Regulator Theorem, GRT) [11]. While influential, the GRT has been criticized for the looseness of its definitions of “model” and “goodness”, and for a proof that does not clearly deliver the headline claim [12]. In modern control theory, the rigorous statement that fills a similar conceptual niche is the Internal Model Principle (IMP): under appropriate hypotheses, perfect regulation or disturbance rejection for a given signal class requires that the controller embed a dynamical copy of the signal generator [13,14,15]. The IMP is precise (and falsifiable) within its scope, and is now a standard backbone for robust control; see [16] for a contemporary review across control, bioengineering, and neuroscience. However, the classical IMP is a linear result: for finite-dimensional LTI plants ( linear, time-invariant meaning the system matrices do not change with time) and exogenous signals generated by a finite-dimensional LTI exosystem, robust asymptotic tracking/disturbance rejection requires that the controller embed a copy of the exosystem dynamics [14]. For nonlinear systems, the appropriate generalization is the nonlinear output-regulation framework: if the regulator equations admit smooth solutions and the plant’s zero dynamics on the regulated manifold are (locally) stable, together with suitable immersion/detectability assumptions, then one can construct dynamic output-feedback regulators that embed a (possibly adaptive) internal model and achieve local or semiglobal robust regulation [17,18,19]. However, absent these structural hypotheses, a complete nonlinear analogue of the IMP with the same necessity/robustness guarantees as in the LTI case is not generally available. Table 1 provides a comparison of the different regulator theorem statements, which can be compared with the one presented here.
In this paper, we recast the modeling requirement in a setting independent of linearity, probability, exact regulation or specific signal classes, by using algorithmic information theory (AIT). We model a world W and a regulator R as deterministic causal Turing machines that interact over interface tapes. We denote the world output by (over some temporal horizon of length N). Our main technical claim is that regulation in the algorithmic sense, i.e., simplicity, forces algorithmic dependence between W and R.
1.1. Definition of Model
A model in the present context is a program capable of compressing (or generating) data. Similarly, “the regulator contains a model of the world” is interpreted in an algorithmic-information sense: the regulator R carries nontrivial information about W, quantified by positive mutual algorithmic information (up to the standard slack). Equivalently, knowing R makes the shortest description of W strictly shorter, . This notion does not require R to embed a dynamical copy of W; rather, it formalizes “model content” as mutual algorithmic information.
We formalize this with the following definition:
Definition 1(Algorithmic “internal model”). Given a fixed horizon N (implicitly conditioned), we say that R contains an internal model of W in the algorithmic sense if (up to ), equivalently . The magnitude of quantifies the amount of computable structure in W that R carries.
Here, “computable structure” means reusable regularities that admit short descriptions (rules, symmetries, constraints, mechanism parameters). Saying that Rshares computable structure with W means that knowing R makes W cheaper to describe: , i.e., . This does not require that the regulator embed a dynamical copy of the exosystem (as in classical IMP); it only requires that the regulator carry algorithmic bits informative about the world’s generative mechanism (e.g., algorithmic elements, time constants, setpoints, disturbance classes, invariances).
The definition grounded on mutual algorithmic information M is further motivated by the following: (i) Machine invariance: M is invariant up to under changes of universal machine. (ii) Distribution-free: M is defined for individual objects (programs), not probabilistic models. (iii) Operational meaning: is precisely the codelength reduction in describing W when R is known, aligning with MDL/Occam reasoning via the Coding Theorem [9,20,21].
This is the appropriate lens for our contrastive results, and it complements the Internal Model Principle, where “model” means a dynamical replica of the Exosystem (a part of the World in our framework) under stated structural hypotheses [14,17,18,19]. Conceptually, our AIT result is complementary to the IMP: whereas the IMP states what structural content must be present in a controller to achieve perfect regulation for a given signal class [13,14,15], our results quantify how much algorithmic information the regulator must carry about the world whenever it succeeds in making the measured outcome compressible.
1.2. Regulation as Compression
We score regulation by how compressible a task-weighted error stream is. Let be the weighted error and the T-sample string. Fix a prefix-free lossless code (e.g., a universal compressor) and define the per-sample codelength . A regulator R is better on horizon T when it makes smaller than a null baseline ⌀, i.e., when the contrastive gap is positive. This choice is natural: for stationary ergodic data, normalized universal codelengths converge almost surely (i.e., with probability one) to the Shannon entropy rate , and (under standard computability assumptions) almost surely; thus the Kolmogorov-based criterion reduces to the Shannon criterion when those stochastic assumptions hold—while remaining meaningful outside them [9,22,23].
To see the connection between regulation and compression in more detail, let denote the Kolmogorov structure function [24]. Regulation amounts to moving down this curve: as the regulator invests model bits (larger ), more regularity in is captured and the residual randomness drops, approaching 0 at perfect regulation. The notion of robability emerges along this path. Replacing set-models by probabilistic models turns the two-part description into the standard MDL form
where the second term is the ideal codelength under (Shannon coding: ) [8,9,23]. If the regulator must hedge over multiple models, the mixture/Bayes code with prior uses and assigns
a valid prefix code whose regret relative to the best single model is bounded by ; with (Solomonoff/Occam), the penalty matches the model description length [7,8,9,20,25]. Thus, *probabilistic/Bayesian regulation is the coding-optimal way to descend *, aligning with the multi-model argument in [26].
Finally, we can treat the regulator input as an error signal quantized at fixed sensor resolution; the per-sample codelength of under a universal compressor converges to the entropy rate for stationary sources. For Gaussian processes,
where is the input power spectral density [22,27] of the error signal x, so attenuating in-band sensitivity (reducing where it matters) reduces codelength [27]. In the scalar white-Gaussian case with variance , , so smaller-amplitude fluctuations (smaller ) mean lower entropy and better compressibility. In short, “compressible error” matches the classical view: good regulation removes variability/uncertainty in the task band and accords with the IMP [14,28].
In the next sections, we first provide an overview of the AIT setting and of the results, followed by the analysis of the single episode scenario. The next section provides a formal definition of the algorithmic regulator and the corresponding theorem.
2. Setting
Unless stated otherwise, U is the standard three–tape universal prefix Turing machine: a read-only input tape holding a self-delimiting program p, a work tape (private scratch memory), and a write-only output tape. When we write we mean that, upon halting, the contents of the output tape equal x; the work tape is never part of the scored output. The domain of halting programs is prefix-free, so Kraft–McMillan applies and the universal a priori semimeasure is well defined. By the invariance theorem, replacing U by any other universal prefix machine (single- or multi-tape) changes all complexities only by an additive ; all Coding-Theorem statements we use depend only on prefix-freeness and therefore remain valid up to these constants (see, e.g., [9]). Our use of prefix-free (self-delimiting) programs is a standard coding convention in AIT: it ensures that program lengths satisfy Kraft’s inequality and lets the coding theorem link description length and universal probability cleanly. Any finite description can be made self-delimiting with only overhead for length n, so this is not a substantive restriction on real systems; it is a technical requirement of the description language (see Appendix A.3) [9].
The (prefix) Kolmogorov complexity of x is the length of its shortest description,
Intuitively, is the best achievable compressed size of x on U. If , then x has a short generative regularity; if , x is (algorithmically) random. By the invariance theorem, K is machine-independent up to an additive constant [9]. A fundamental limitation is that is not computable: no algorithm can output for all x [9,29]. However, algorithms for upper bounds of exist, as we discuss below.
Given auxiliary data y on a read-only auxiliary tape, the conditional complexity
is the shortest description of x given y. It operationalizes how much new information is needed to reconstruct x once y is known (e.g., “world given regulator,” or “output given model”).
The mutual algorithmic information (up to the usual slack) is
measures the algorithmically shared structure between x and y: how many bits we save when describing one with the help of the other. In our setting, “the regulator contains a model of the world” means (information-theoretic dependence); this is a necessary information-theoretic analogue of “model” containment that complements—rather than replaces—the classical IMP notion of a dynamical replica.
Intuitively, strings produced by shorter programs are more likely. Solomonoff–Levin’s universal a priori semimeasure and the Coding Theorem link probability and description length,
providing a universal Occam calculus over individual strings [9,20,21,25].
In what follows, a finite temporal horizon N is fixed throughout; unless stated otherwise, we implicitly condition on N (e.g., write for ). All constants depend only on the choice of U (and the fixed constant-overhead wrapper that decodes and simulates their coupling to print the readout), never on particular strings; see Appendix A.2.
The Coupled World-Regulator System
We work with 3-tape Turing machines W and R (see Figure 1 and Appendix A.2). We identify each machine with its minimal self–delimiting program ( , ) [9]. A horizon is fixed and all complexities are conditioned on N unless otherwise stated. W and R interact causally for N steps, producing a deterministic readout . The dynamical equations are
The performance of the regulator is evaluated from the complexity of the output, . Intuitively, a good regulator produces outputs of lower complexity than the unregulated case. Since is computable from ,
To disentangle the role of R from the coarse event “ is small,” we fix a null regulator ⌀ (where R’s output is set to zero). We compare the events
with . Event rules out worlds that produce a simple output without regulation; the intersection isolates R’s contribution.
For notational simplicity, the on–case and off–case readouts are also expressed as
For a fixed time horizon, we write for the full output produced by W when coupled to R.
In the next sections, we provide our main results regarding mutual information between the world and regulator, and implications for inferring agent-like behavior in the regulator.
3. Probabilistic Regulator Theorems
3.1. Posterior Form, Given the Observed x
Lemma 1(Program posterior given x). With prefix prior and deterministic likelihood ,
Consequently, by (5),
Proof. For any finite string x,
(recall Equation (1)). The Coding Theorem gives machine-dependent constants with
Now, briefly, Bayes’ rule yields ; apply (5). In more detail, place the prefix prior on programs p and use the deterministic likelihood . Then, the evidence is and the posterior is
Then, for any p with ,
The relation between and holds only up to an additive term in K, which becomes a multiplicative constant on . This ambiguity is unavoidable and depends on the choice of universal prefix machine U; absorb exactly this machine-dependent slack. □
Now, in our setting the world W and regulator R are programs that interact for N steps, producing the on-case readout . A fixed, constant-overhead wrapper decodes a shortest description of and simulates the coupling to print x (decode + simulate); if denotes this canonical code, then
for constants .
Now we can use the definition of mutual algorithmic information (up to the usual slack) to write
and derive our first result:
Theorem 1.
3.2. The Good Algorithmic Regulator and Posterior with Contrast
For our second result, we first define the Good Algorithmic Regulator (GAR).
Definition 2(Good Algorithmic Regulator, contrastive). Given the on/off complexities and gap
we say that R is a good algorithmic regulator of gap for W at horizon N if .
Lemma 2(OFF run lower-bounds the world). There exists such that
Proof. Given , the wrapper simulates the OFF dynamics and prints with overhead. □
With this definition we can now state and prove our main theorem.
Theorem 2(Probabilistic regulator theorem). Let and be observed and let . Then, there exists such that
Equivalently, every bit by which falls short of Δ costs a factor in posterior support.
Proof. (i) Posterior via wrapper. From Equation (6), (ii) Decompose . We use the exact mutual information , hence
(iii) Insert OFF bound (where b enters). By Lemma 2, , so
(iv) Exponentiate and absorb constants. Exponentiating and using gives for a constant C absorbing and the wrapper Coding-Theorem constants. □
Clarifications:
(i) Where does b appear? Only via Lemma 2, which says the OFF run lower-bounds . We never need to compute b explicitly. (ii) Why can we drop ? A slightly sharper bound is . Since , dropping keeps the focus on the two interpretable scalars M and without changing the exponential scaling. (iii) Architecture-agnostic. The proof only uses the computable wrapper . Whether R is open- or closed-loop does not affect the posterior algebra. (iv) The posterior on the left of Theorem 2 is conditioned on the on-case observation x only. The off-case run is used solely to supply a numeric lower bound , which implies by simulation. Formally, we phrase the result as a bound on , where is the side-event “ ”.
As a consequence of Theorem 2, one can bound individual posterior masses by . This implies an exponential tail: . In other words, is concentrated within of its maximum . I.e., there exists (machine/wrapper dependent only) such that for all integers ,
How to read (and use) Theorem 2:
- What we measure: compute the on/off complexities and (in practice: fixed MDL code lengths); their difference is the compressibility advantage.
- What the bound says: for any explanation of the observed x, the universal posterior weight is penalized as unless the pair shares structure: larger compensates the penalty.
- Practical rule of thumb: sustained large across tasks makes low exponentially unlikely. If off-case b is already small, will be small—choose a diagnostic readout so the null is not trivially simple.
3.3. Inferring the Objective Function and Planner (As-If Agent)
We next provide a simple theorem regarding the role of complexity as an objective function.
Theorem 3(On/Off evidence equals unconditioned complexity gap). Under the universal a priori semimeasure,
Equivalently, writing the on/off gap as we have Hence, on the realized pair , maximizing the likelihood of “ON over OFF” is equivalent (up to a constant factor) to minimizing or, equivalently, maximizing the gap Δ*.*
Proof. By the Coding Theorem there exist machine-dependent constants such that for any string z. Apply this to x and , take base-2 logs, and subtract:
so □
This statement compares two different strings (the realized ON and OFF outputs) and aligns with the contrastive quantities used elsewhere. The log universal Bayes factor for “ON vs. OFF” is seen to equal the complexity gap Thus, on each episode, a regulator behaves as if it were maximizing the scalar equivalently minimizing .
Thus, given a regulator R that persistently reduces the readout’s complexity relative to a null baseline ⌀ (the GAR setting of Definition 2), we can justify—on purely observational grounds—that R behaves as if it were minimizing a scalar objective. The objective should be canonical (not post hoc) and usable across episodes/tasks.
4. Discussion
Classical control theory—especially in the LTI case—provides powerful constructive synthesis methods yielding transparent regulator architectures under explicit model classes. Our results are different in scope: they give a distribution-free, single-instance necessity/diagnostic statement. If regulation induces a nontrivial contrastive compressibility gap, then the regulator R must carry algorithmic information about the world W.
However, this theoretical necessity does not by itself provide a controller design algorithm, and its practical application relies on computable surrogates for Kolmogorov complexity, such as Lempel-Ziv compression [22] or neural autoencoders [30,31]. Furthermore, to detect agency in a biological or social system, or in digital life systems—such as Conway’s Game of Life [32] or continuous cellular automata like Lenia [33,34]—one must first tentatively define a “membrane” (Markov blanket [5,35,36]) that separates the putative agent R from its environment W. By inspecting the input-output stream across various candidate boundaries, we can identify agents as those subsystems that maximize the compressibility gap . In this sense, while probabilistic and model-class-based approaches remain indispensable for constructive designs and performance guarantees, our AIT framework acts as an “outer layer” diagnostic that characterizes when “having a model” (in the information-theoretic sense) is unavoidable.
We summarize now our results:
- First regulator result: posterior form, given the observed x (Theorem 1).
By Solomonoff induction and the Coding Theorem [20,21,25,37,38], we showed that
Thus shorter joint generators are exponentially preferred; every extra bit in halves the posterior weight. Decomposing
shows that, at fixed marginals , the posterior is exponentially tilted in the algorithmic mutual information : each extra bit of multiplies posterior odds by .
Second regulator result: posterior with contrast (Theorem 2).
Without contrast, the story is pure Occam: (9) anchors the posterior near with a geometric excess-length tail; for fixed , this yields a high-probability lower bound on roughly . With contrast, if turning the regulator on yields while the off case has with , then any explaining obeys
so low mutual information is exponentially disfavored as the gap grows. In both regimes, the operational slogan holds: see a simple string ( small), suspect a simple generator ( small), and at fixed marginals, this means *suspect larger *.
The intuition behind these results is that seeing a simple string suggests its generation by a simple program. Formally, for the coupled hypothesis (wrapped as a single self-delimiting program), observing yields the Solomonoff posterior by the Coding Theorem [20,21,25,37,38]. Every extra bit of joint description halves posterior weight. This is the quantitative Occam tilt that operationalizes the slogan above.
The posterior mass of joint programs longer than decays geometrically:
Hence, the typical joint length is near . If and are externally constrained (e.g., by design or prior knowledge), this tail translates directly into a lower posterior bound on of the form with posterior confidence .
Our results are most informative when the observed readout is simple. If is large, the posterior constraints on joint complexity and on mutual information are inherently weak. From the geometric tail, for any there exists such that, with posterior probability at least ,
At fixed marginals and this yields
Hence, if is large (comparable to ), the lower bound on may be trivial (near 0 up to logs). Intuitively, a complex output does not force shared structure. It is compatible with a complex joint generator even when W and R share little algorithmic information.
On the other hand, the strength of the conclusion depends on the gap :
Thus even if is not very small, a large off/on gap still enforces a large posterior . In other words, contrast rescues identifiability of shared structure: the evidence scales exponentially in .
In the same universal calculus, regulation carries a canonical scalar interpretation: runtime behavior is as if minimizing (i.e., maximizing the on/off gap ), and design-time comparison across explanations favors larger via the GAR posterior tilt. This supplies an MDL/Occam objective grounded in the coding theorem (not an ad hoc utility) and complements the IMP’s structural requirements.
We note that a low alone does not prove high ; it concentrates posterior mass on short joint generators P. High follows (i) when and are fixed/known, or (ii) when contrast pins high via the off case. Without such constraints, short P could also arise from individually simple W and R.
Third regulator result: as-if Objective-function minimization (Theorem 3).
On the realized , the conditional Coding Theorem gives . Thus, the runtime scalar to minimize is . Together with the above, this implies that the regulator is acting (as-if) like an algorithmic agent (with a model of the world, objective function and planner).
Theorem 3 is a representation statement—not a mechanism: R need not compute K, but persistent large is exactly what maximizes universal evidence for “ON”, and it simultaneously makes low exponentially unlikely. For a mechanistic objective beyond the Minimum Description Length (MDL) evidence, three constructive routes are standard and complementary. First, in Linear Time-Invariant (LTI) plants the Internal Model Principle makes a structural claim—perfect robust regulation for a specified signal class requires embedding a dynamical copy of the exosystem in the controller—and optimal stabilizing designs arise from explicit quadratic/convex costs (e.g., the Linear Quadratic Regulator, LQR); in the nonlinear case, output-regulation theory yields constructive regulators under solvable regulator equations together with immersion/detectability and (local) zero-dynamics stability [13,14,15,17,18,19,39]. Second, in inverse optimal control and inverse reinforcement learning (IRL), trajectories that satisfy Karush–Kuhn–Tucker (KKT) regularity allow identification of a cost J (up to equivalences) whose minimizers reproduce the behavior; in discrete settings, IRL recovers reward functions consistent with observed policies [40,41,42]. Third, in revealed-preference analysis, if cross-episode choices satisfy the Generalized Axiom of Revealed Preference (GARP), Afriat and Varian guarantee the existence of a strictly increasing, concave utility that rationalizes the data, while Debreu’s representation and the Savage/Karni–Schmeidler frameworks provide (state-dependent) expected-utility forms under their axioms [43,44,45,46,47].
Planner/policy representation (as-if agent).
Any deterministic causal regulator R induces a computable policy mapping the coupled history (past interface I/O up to time t) to the next actuator symbol. This is simply the operational semantics of R viewed as a function of histories.
The coding-theorem Bayes-factor identity (Theorem 3) supplies a canonical scalar such that, on the realized episode, the sequence of actions produced by is as if chosen to maximize J subject to the world dynamics. Together with the algorithmic “internal model” conclusion (i.e., ), this yields the standard agent triad:
Interpretation. This is a representation statement, not a claim that R explicitly solves an optimization problem or contains a modular planner. The existence of is tautological for any deterministic R; the “as-if” objective follows from the universal evidence identity above. Across tasks/episodes, if the induced choices satisfy standard consistency axioms (e.g., GARP), classical revealed-preference theorems guarantee the existence of a (monotone, concave) utility that rationalizes the behavior [43,44]; and in dynamical settings, inverse optimal control/inverse RL constructs a cost for which the observed policy is (near-)optimal [40,41]. Thus, given (i) algorithmic model content and (ii) the canonical scalar J from the coding-theorem calculus, interpreting the regulator as carrying a policy/planner is both natural and technically justified.
4.1. Why AIT Is Needed
Our results are single-episode and distribution-free: they make statements about an individual realized readout x and about the pair as concrete programs, without positing a stochastic source. Classical (Shannon) information theory quantifies expected code lengths and mutual information with respect to a specified probability law; entropy and mutual information are undefined without a distribution, and asymptotic statements (AEP/typical sets) further require ergodicity/mixing assumptions [23]. In our setting, there is no given probabilistic model over worlds, regulators, or outputs—indeed, the point is to infer model content from a single realized x.
AIT supplies exactly the missing calculus. First, it provides a canonical, machine-invariant complexity for individual strings, , and a universal a priori semi measure (Solomonoff– Levin), connected by the Coding Theorem: [20,21,25]. This yields a universal Occam posterior over programs, from which (i) the geometric excess-length tail and (ii) our contrastive tilt bounds follow. No analogue exists in Shannon’s framework without positing an external prior over programs; there is no “canonical” or in Shannon theory.
Second, AIT lets us formalize “the regulator contains a model of the world” as algorithmic dependence, i.e., positive mutual algorithmic information (equivalently ), a notion defined for individual objects and invariant up to [9]. By contrast, Shannon’s requires a joint distribution over , which is neither given nor natural here.
Third, our key inequalities explicitly use and prefix complexity: the posterior tilt , the OFF-run lower bound on by simulation, and the contrastive penalty all rely on the Coding Theorem and Kraft–McMillan properties of prefix programs—again, objects absent from Shannon’s ensemble-level calculus.
Finally, while one can approximate with MDL/codelengths in practice, MDL’s justification itself rests on the AIT view that shorter descriptions are better and on the coding-theorem linkage between description length and (universal) probability [8]. In short: AIT provides the universal prior (m), object-level complexities (K), and mutual algorithmic information (M) needed to turn the informal slogan “see a simple string, suspect a simple generator” into posterior and contrastive theorems—none of which can be stated in Shannon’s framework without ad hoc model classes and priors.
4.1.1. Relation to the Internal Model Principle (IMP)
In the IMP, the closed loop is : an autonomous exosystem E (no inputs and no explicit time dependence, e.g., ), a controller C (the regulator), and a plant P. The regulated error is , where the reference r and disturbances are generated by E and y is measured from P [13,14]. In our notation, we group the World as and take the Regulator as (see Figure 2 and Table 2 for the comparison of the two frameworks in the case of a thermostat).
The assumptions in the IMP theorems are: (i) Classical necessity is sharpest for finite-dimensional LTI plants (linear, time-invariant) with exogenous signals generated by a finite-dimensional, neutrally stable LTI E; stabilizability/detectability and robustness (one fixed C works for a plant neighborhood) are standard [13,14]. (ii) The structural conclusion is internal-model necessity: perfect robust regulation for the specified signal class requires that C embed a dynamical copy of E (e.g., integrators for steps, oscillators for sinusoids); in MIMO, a p-copy is needed. (iii) Nonlinear generalizations (output regulation) require solvability of the regulator equations, suitable immersion/detectability, and (local) stability of the zero dynamics; guarantees are typically local/semiglobal, and necessity is not universal [17,18,19]. (iv) Infinite-dimensional/distributed settings and periodic signals may require infinite-dimensional internal models; technicalities arise with unbounded I/O operators [16].
In the AIT formulation (here), we assume: (i) Architecture-agnostic: no required split into E vs. P, and no specified place where R enters the causal path; we only assume a computable wrapper mapping for a fixed horizon N. (ii) Deterministic, closed coupling of world and regulator (no stochastic noise sources into W); statements are distribution-free and about the realized sequence. (iii) “Model” means algorithmic dependence: (equivalently ), not a literal dynamical replica. (iv) The main necessity is probabilistic: a positive on/off complexity gap exponentially tilts the universal posterior against explanations with small ; no linearity, smoothness, or regulator-equation conditions are imposed. See Section 2, Section 3, Section 4 and Section 5 and Appendix A of this work.
IMP yields a structural necessity (internal model in C of E) under explicit dynamical hypotheses; the AIT formulation yields an information-theoretic necessity (positive favored by the data) without assuming linearity, an split, or a particular causal insertion point for R. The two are complementary: IMP is the backbone for constructive regulation in structured classes; the AIT view covers unstructured architectures and single episodes with a universal Occam calculus [13,14,15,16,17,18].
The home thermostat.
As an example, consider a home thermostat as a regulator/controller. Let P be the living room + heater dynamics (thermal capacitance, heat loss, delays) and E the exogenous processes (setpoint schedule, outdoor weather/solar, occupancy). The Internal Model Principle (IMP) states that exact output regulation for a specified signal class is possible only if the controller embeds a copy of the exosystem E that generates those signals (e.g., an integrator for steps, an oscillator for a fixed sinusoid); plant knowledge is used for stabilization/shaping, but the IMP necessity targets E itself [14,17]. In our AIT view, a regulator R is “good” when it makes the realized readout more compressible than a null baseline; a sustained compressibility gap implies that R shares computable structure with the whole world :
A simple on/off thermostat with a deadband tuned to the room time constant typically yields a bounded limit cycle (not zero steady-state error); under IMP, it lacks the needed internal model of constants (no embedded integrator), hence it does not achieve exact regulation of the “constant” class [14,28]. Nevertheless, in the AIT sense, it still qualifies as a regulator: its policy encodes a very compressed model spanning P (heating raises T, room inertia) and weak regularities in E (quasi-constant setpoint, slowly varying weather), giving [11]. PI/PID or predictive thermostats remedy the IMP shortfall by embedding the appropriate internal model (and often explicit models of P and aspects of E) [28].
The AIT regulator framework (as well as the original GRT) is therefore more general than IMP: the regulator must carry a model of the world , where P is the plant (house/HVAC thermodynamics) and E the exogenous processes (setpoint schedule, weather/solar/occupancy), and IMP is recovered as a special case when the performance target is exact output regulation over a specified signal class. Under IMP, a controller qualifies for exact regulation only if it embeds a dynamical copy of the exosystem that generates the reference/disturbances (e.g., an integrator for steps, oscillators for sinusoids)—no model of P is required beyond stabilizability/detectability [14]; nonlinear output regulation extends this under additional immersion/detectability and regulator-equation solvability assumptions [17].
Our statements are thus complementary and distinct: in AIT, we work in a distribution-free, program-level setting and make no linearity or smoothness assumptions. We remain agnostic about what the regulator needs to model and do not demand exact regulation. We do not assert the existence of a dynamical replica inside R. Instead, we show that sustained contrastive compressibility ( ) tilts the universal posterior toward pairs with larger mutual algorithmic information , i.e., R carries algorithmic structure about W. Thus, “the regulator contains a model” is made precise as (information-theoretic dependence), not as an embedded exosystem. The IMP supplies structural necessity for perfect regulation within specified signal classes; our AIT results supply information-theoretic necessity for observed compressibility advantages, beyond linearity or probabilistic assumptions [15].
4.1.2. Practical Estimation of K and the Gap Δ
Our theorems are stated in terms of prefix Kolmogorov complexity, which is not computable. In practice, one can fix a reference prefix code C and estimate upper bounds,
with the same compressor C used across all conditions. Persistent across tasks is cumulative evidence that explanations with low are exponentially unlikely; maximizing is the natural scalar objective, the regulator appears to optimize on the observed data.
Some standard choices for providing upper bounds to Kolmogorov complexity are Lempel-Ziv compressors (LZ77/LZ78/LZW). LZ-type compressors are universal in a weak sense for stationary ergodic sources and are widely available. Implementations (gzip, lz4, etc.) are practical proxies for [22,48]. If both ON and OFF strings are available and a scale-free sanity check of contrast is needed, we can compute
where is the chosen code length and is concatenation [49,50]. NCD is heuristic but can reveal whether x is “closer” to trivial baselines than y.
The Block Decomposition Method (BDM) estimates K by tiling a string (or array) into small blocks whose complexities are looked up from Coding-Theorem-Method (CTM) tables (exhaustive output frequency statistics of small machines), plus a logarithmic penalty for multiplicities, where are distinct blocks and their multiplicities (see [51,52]). This is sensitive to small-scale algorithmic regularities beyond LZ’s parse statistics; it works on 1D/2D data (but depends on the chosen CTM table—size and machine model—and it suffers from boundary/tiling effects and additive constants that can be sizable for short N).
Finally, alternatives include learned compressors based on neural networks. Autoencoder/ variational–autoencoder codecs optimize a rate–distortion (thus MDL) objective, with an explicit codelength view via ELBO and practical lossless coding through bits-back [8,30,53,54,55]. In images and video, end-to-end trained autoencoders, hyperpriors, and autoregressive priors are now standard [31,56,57]. More recently, diffusion models have emerged as a powerful paradigm for high-fidelity perceptual compression, outperforming GANs and VAEs in realism at low bitrates [58,59]. Transformer-based compressors are also rapidly improving—both for images via hybrid Transformer–CNN codecs [60,61] and for general lossless compression using language-model predictors [62,63]. For a comprehensive benchmark of neural lossless compressors, see [64]. See also cross-modal results reported with large models [65] and neural codecs for audio, which now leverage foundation model representations [66,67]. From an MDL perspective, these models implement universal codes whose lengths upper-bound the negative log-likelihood under the learned generative model.
To improve discrimination, we can (i) use paired ON/OFF measurements on the same horizon N; report and its sampling variability across repeats/seeds; (ii) include trivial controls (e.g., all-zero regulator and randomized regulator) to sanity-check that responds in the expected direction; (iii) for finite N, complement point estimates with nonparametric tests (paired permutations on across episodes); (iv) when outputs are multivariate/real-valued, discretize with a fixed, reported quantization and alphabet before compression.
5. Conclusions
We developed a contrastive, algorithmic formulation of regulation: a regulator R is good for a world W at horizon N when it yields a compressible readout that is strictly more compressible than under a null baseline ⌀. This places the GRT claim (“good regulators are models”) on an AIT footing.
If switching a regulator on makes a system’s measured output much simpler to describe (i.e., more compressible) than when the regulator is off, then the regulator is very likely to carry non-trivial information about the world it controls—in the precise Algorithmic Information Theory sense of positive mutual algorithmic information between world and regulator. The strength of this evidence grows exponentially with the compressibility gap: large makes explanations with little shared structure vanishingly likely. Practically, this turns the old cybernetics slogan “every good regulator is a model of the system” into a quantitative, testable claim that does not assume linearity, stochastic models, or specific architectures. On each run, the theorem also singles out a canonical scalar objective: the regulator behaves as if it were minimizing the description length of the realized readout (equivalently, maximizing ).
Probabilistically, if W and R are independently sampled minimal programs (no mutual information), then low readout complexity—and especially the contrastive event “low under R, high under ⌀”—is exponentially unlikely in and . Thus, sustained compressibility relative to baseline is strong evidence that R shares non-trivial algorithmic structure with W ( ). This is the AIT face of the Good Regulator idea and complements the Internal Model Principle’s structural necessity results for classical regulation: the IMP identifies structural necessities for perfect/robust regulation in classical settings, whereas our AIT view applies beyond linearity and probability and turns regulation into a statement about description length. This bridge clarifies in what limited (yet precise) sense the cybernetics aphorism “good regulators must model” can be made rigorous [11,12]: successful regulation implies positive mutual algorithmic information between world and regulator.
The result supplies: (i) a distribution-free, single-episode diagnostic for “does the controller contain a model?”, (ii) a complement to the IMP (which requires embedding a copy of the signal generator under more restrictive and structured assumptions), and (iii) a simple experimental recipe—fix a lossless compressor, quantize the readout, compute two code lengths (ON vs. OFF), and use their difference as evidence of model content in the controller.
Finally, the coding-theorem view identifies a canonical scalar and implicates a planner: runtime minimization of (equivalently, maximization of ).
All together, these results provide the grounds to justify that if a system is seen to regulate another in the algorithmic sense (reducing the complexity of an output of the regulated system compared to no regulation), we can reasonably infer it is likely that the regulator uses a model of the regulated system and an associated scalar objective function.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ruffini G. An Algorithmic Information Theory of Consciousness Neurosci. Conscious.20172017 nix 01910.1093/nc/nix 01930042851 PMC 6007168 · doi ↗ · pubmed ↗
- 2Ruffini G. Lopez-Sola E. AIT Foundations of Structured Experience J. Artif. Intell. Conscious.2022915319110.1142/s 2705078522500047 · doi ↗
- 3Ruffini G. Castaldo F. Vohryzek J. Structured Dynamics in the Algorithmic Agent Entropy 2025279010.3390/e 2701009039851710 PMC 11765005 · doi ↗ · pubmed ↗
- 4Friston K. A Free Energy Principle for Biological Systems Entropy 2012142100212110.3390/e 1411210023204829 PMC 3510653 · doi ↗ · pubmed ↗
- 5Parr T. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior The MIT Press Cambridge, MA, USA 2022
- 6Gács P. Tromp J. Vitányi P.M.B. Algorithmic Statistics IEEE Trans. Inf. Theory 2001472443246310.1109/18.945257 · doi ↗
- 7Barron A.R. Rissanen J. Yu B. The Minimum Description Length Principle in Coding and Modeling IEEE Trans. Inf. Theory 1998442743276010.1109/18.720554 · doi ↗
- 8Grünwald P.D. The Minimum Description Length Principle MIT Press Cambridge, MA, USA 200710.7551/mitpress/4643.001.0001 · doi ↗
