
TL;DR
This paper defines and explores a novel concept of entropy modulo a prime p, extending classical entropy ideas into modular arithmetic and connecting it with polylogarithm identities.
Contribution
It introduces a unique definition of entropy in modular arithmetic, characterizes it via a functional equation, and links it to polylogarithm-related identities.
Findings
Entropy mod p is uniquely characterized by a functional equation.
Connections established between real entropy residues and entropy mod p.
Entropy mod p can be expressed as a polynomial satisfying specific identities.
Abstract
Building on work of Kontsevich, we introduce a definition of the entropy of a finite probability distribution in which the "probabilities" are integers modulo a prime p. The entropy, too, is an integer mod p. Entropy mod p is shown to be uniquely characterized by a functional equation identical to the one that characterizes ordinary Shannon entropy. We also establish a sense in which certain real entropies have residues mod p, connecting the concepts of entropy over R and over Z/pZ. Finally, entropy mod p is expressed as a polynomial which is shown to satisfy several identities, linking into work of Cathelineau, Elbaz-Vincent and Gangl on polylogarithms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Mathematical Identities
Entropy modulo a prime
Tom Leinster School of Mathematics, University of Edinburgh, Scotland; email [email protected]. MSC 2010: 94A17 (primary), 11A07, 11A99, 11T06, 13N15.
Abstract
Building on work of Kontsevich, we introduce a definition of the entropy of a finite probability distribution in which the ‘probabilities’ are integers modulo a prime . The entropy, too, is an integer mod . Entropy mod is shown to be uniquely characterized by a functional equation identical to the one that characterizes ordinary Shannon entropy. We also establish a sense in which certain real entropies have residues mod , connecting the concepts of entropy over and over . Finally, entropy mod is expressed as a polynomial which is shown to satisfy several identities, linking into work of Cathelineau, Elbaz-Vincent and Gangl on polylogarithms.
Contents
- 1 Introduction
- 2 Logarithms and derivations
- 3 The definition of entropy
- 4 The chain rule
- 5 Unique characterization of entropy
- 6 Information loss
- 7 The residue mod of real entropy
- 8 Entropy as a polynomial
- 9 Polynomial identities satisfied by entropy
1 Introduction
The concept of entropy is applied in almost every branch of science. Less widely appreciated, however, is that from a purely algebraic perspective, entropy has a very simple nature. Indeed, Shannon entropy is characterized nearly uniquely by a single equation, expressing a recursivity property. The purpose of this work is to introduce a parallel notion of entropy for probability distributions whose ‘probabilities’ are not real numbers, but integers modulo a prime . The entropy of such a distribution is also an integer mod .
We will see that despite the (current) lack of scientific application, this ‘entropy’ is fully deserving of the name. Indeed, it is characterized by a recursivity equation formally identical to the one that characterizes classical real entropy. It is also directly related to real entropy, via a notion of residue informally suggested by Kontsevich [14].
Among the many types of entropy, the most basic is the Shannon entropy of a finite probability distribution , defined as
[TABLE]
It is this that we will imitate in the mod setting.
The aforementioned recursivity property concerns the entropy of the composite of two processes, in which the nature of the second process depends on the outcome of the first. Specifically, let be a finite probability distribution, and let be further distributions, writing . Their composite is
[TABLE]
a probability distribution on elements. (Formally, this composition endows the sequence of simplices with the structure of an operad.) The chain rule (or recursivity or grouping law) for Shannon entropy is that
[TABLE]
The chain rule can be understood in terms of information. Suppose we toss a fair coin and then, depending on the outcome, either roll a fair die or draw fairly from a pack of 52 cards. There are possible final outcomes, and their probabilities are given by the composite distribution
[TABLE]
Now, the entropy of a distribution measures the amount of information gained by learning the outcome of an observation drawn from (measured in bits, if logarithms are taken to base ). In our example, knowing the outcome of the composite process tells us with certainty the outcome of the initial coin toss, plus with probability the outcome of a die roll and with probability the outcome of a card draw. Thus, the entropy of the composite distribution should be equal to
[TABLE]
This is indeed true, and is an instance of the chain rule.
A classical theorem essentially due to Faddeev [11] states that up to a constant factor, Shannon entropy is the only continuous function assigning a real number to each finite probability distribution in such a way that the chain rule holds. In this sense, the chain rule is the characteristic property of entropy.
Our first task will be to formulate the right definition of entropy mod . An immediate obstacle is that there is no logarithm function mod , at least in the most obvious sense. Nevertheless, the classical Fermat quotient turns out to provide an acceptable substitute (Section 2). Closely related to the real logarithm is the nonlinear derivation , and its mod analogue is (a -derivation, in the language of Buium [4]).
The entropy of a mod probability distribution , with , is then defined as
[TABLE]
where is an integer representing (Section 3). The definition is independent of the choice of representatives . This entropy satisfies a chain rule formally identical to that satisfied by real entropy (Section 4). We prove in Section 5 that up to a constant factor, is the one and only function satisfying the chain rule. This is the main justification for the definition.
Classical Shannon entropy quantifies the information associated with a probability space, but one can also seek to quantify the information lost by a map between probability spaces, seen as a deterministic process. For example, if one chooses uniformly at random a binary number with ten digits, then discards the last three, the discarding process loses three bits.
There is a formal definition of information loss, it includes the definition of entropy as a special case, and it has been uniquely characterized in work of Baez, Fritz and Leinster [2]. The advantage of working with information loss rather than entropy is that the characterizing equations look exactly like the linearity and homomorphism conditions that occur throughout algebra—in contrast to the chain rule. In Section 6, we show that an analogous characterization theorem holds mod .
We then make precise an idea of Kontsevich linking entropy over with entropy over . Consider a distribution whose probabilities are rational numbers. On the one hand, we can take its real entropy . On the other, whenever is a prime not dividing the denominator of any , we can view as a probability distribution mod and therefore take its entropy mod . Kontsevich suggested viewing as the ‘residue’ of , and Section 7 establishes that this construction has the basic properties that one would expect from the name.
Finally, we analyse not as a function but as a polynomial (Sections 8 and 9). We show that
[TABLE]
(which formally is equal to ). We prove several identities in this polynomial. In the case of distributions on two elements, we find that
[TABLE]
for , and we discuss some properties that this polynomial possesses.
The present work should be regarded as a beginning rather than an end. In information theory, Shannon entropy is just the simplest of a family of fundamental concepts including relative entropy, conditional entropy, and mutual information. It is natural to seek their mod analogues, and to prove analogous theorems; however, this is not attempted here.
Related work
This work builds on a two-and-a-half page note of Kontsevich [14]. In it, Kontsevich did just enough to show that a reasonable definition of entropy mod must exist, but without actually giving the definition except for probability distributions on two elements. He also briefly suggested viewing the entropy mod of a distribution with rational probabilities as the ‘residue’ of its real entropy. The relationship between his note and the present work is further clarified at the start of Section 7 and the end of Section 9.
Kontsevich’s note appears to have been motivated by questions about polylogarithms. (The polynomial (8) is a truncation of the power series of , and one can consider more generally a truncation of the th polylogarithm.) That line of enquiry has been pursued by Elbaz-Vincent and Gangl [9, 10]. As recounted in the introduction to [9], some of Kontsevich’s results had already appeared in papers of Cathelineau [5, 6]. The connection between this part of algebra and information theory was noticed at least as far back as 1996 ([6], p. 1327). In the present work, however, polylogarithms play no part and entropy takes centre stage.
A fully-fledged theory of information cohomology has been introduced by Baudot and Bennequin [3] and extended in several directions by Vigneaux [17]; it concerns topos invariants of categories of random variables. A basic result is that Shannon entropy is the only nontrivial cohomology class in degree (for a suitable choice of coefficients). The characterization below of entropy mod can also be understood in terms of degree information cohomology, over .
Unlike much previous work on characterizations of entropies, we are able to do without symmetry axioms. (Nor is symmetry used in information cohomology, as noted after Theorem 1 in [3].) For example, Faddeev’s theorem on real entropy [11] characterized it as the unique continuous quantity satisfying the chain rule and invariant under permutation of its arguments. However, a careful reading of the proof shows that the symmetry assumption can be dropped. The axiomatization of entropy via the so-called fundamental equation of information theory also uses a symmetry assumption. While symmetry appears to be essential to that approach (Remark 9.6), we will not need it.
The chain rule (3) is often stated in the case , , or occasionally in the different case . In the presence of the symmetry axiom, either of these cases implies the general case, by induction. For example, Faddeev used the first case, whose asymmetry forced him to add the symmetry axiom; but that can be avoided by assuming the chain rule in its general form.
The operation mentioned above is basic in the theory of -derivations (as in Buium [4]), which are themselves closely related to Frobenius lifts and the Adams operations on -theory (as in Joyal [13]).
One can speculate about extending the theory of entropy to fields other than and , and in particular to the -adic numbers. (The -adic entropy of Deninger [7] is of a different nature.) Again, there may be a connection with the information cohomology of Baudot, Bennequin and Vigneaux, which takes place over an arbitrary field.
Convention
Throughout, denotes a prime number, possibly .
Acknowledgements
I thank James Borger, Herbert Gangl and Todd Trimble for enlightening conversations.
2 Logarithms and derivations
Real entropy is a kind of higher logarithm, in the senses that it has the multiplication-to-addition property
[TABLE]
(in notation defined at the end of Section 4), and that when restricted to uniform distributions, it is the logarithm function itself:
[TABLE]
To find the right definition of entropy mod , we therefore begin by considering mod notions of logarithm.
Lagrange’s theorem immediately implies that there is no logarithm mod , in that the only homomorphism from the multiplicative group to the additive group is trivial. However, there is a substitute. For an integer not divisible by , the Fermat quotient of mod is the integer
[TABLE]
We usually regard as an element of . Eisenstein [8] observed:
Lemma 2.1**.**
The map has the following properties:
- i.
* for all not divisible by , and ;* 2. ii.
* for all with not divisible by ;* 3. iii.
* for all not divisible by .*
Proof**.**
Elementary calculations using Fermat’s little theorem.
The lemma implies that defines a group homomorphism
[TABLE]
It is surjective, since by the lemma again, it has a section .
The Fermat quotient is the closest approximation to a logarithm mod , in the sense that although there is no nontrivial group homomorphism , it is a homomorphism . It is essentially unique as such:
Proposition 2.2**.**
*Every group homomorphism is a scalar multiple of the Fermat quotient. *
Proof**.**
This follows from the standard fact that the group is cyclic (Theorem 10.6 of Apostol [1], for instance), together with the observation that is nontrivial (being surjective). Indeed, let be a generator of ; then given , we have where .
Our characterization theorem for entropy mod will use the following characterization of the Fermat quotient.
Proposition 2.3**.**
Let be a function. The following are equivalent:
- i.
* and for all not divisible by ;* 2. ii.
* for some .*
Proof**.**
Since satisfies the conditions in (i), so does any constant multiple. Hence (ii) implies (i). The converse follows from Proposition 2.2.
The entropy of a real probability distribution is
[TABLE]
where
[TABLE]
The operator is a nonlinear derivation, in the sense that
[TABLE]
In particular, . The entropy of therefore measures the failure of the nonlinear operator to preserve the sum :
[TABLE]
We will define entropy mod in such a way that the analogue of this equation holds.
The mod analogue of is the function defined by
[TABLE]
We usually abbreviate to , and treat as an integer mod . Evidently the element of depends only on the residue class of mod , so we can also view as a function .
Lemma 2.4**.**
* for all , and . *
Proof**.**
This is an elementary consequence of Fermat’s little theorem.
3 The definition of entropy
For , write
[TABLE]
An element of will be called a probability distribution mod , or simply a distribution. We will define the entropy of any such distribution.
A standard elementary lemma will be repeatedly useful:
Lemma 3.1**.**
*Let . If then . *
Proof**.**
Write and expand using the binomial theorem.
The observations at the end of Section 2 suggest defining entropy mod by the analogue of equation (16), replacing by . In principle this is impossible, as is only well-defined on congruence classes mod , not mod . Thus, for , the term is not well-defined. Nevertheless, the strategy can be implemented:
Lemma 3.2**.**
For all and such that ,
[TABLE]
Proof**.**
The right-hand side is an integer, since . The lemma is equivalent to the congruence
[TABLE]
Cancelling, this reduces to
[TABLE]
But , so \bigl{(}\sum a_{i}\bigr{)}^{p}\equiv 1\pmod{p^{2}} by Lemma 3.1.
Definition 3.3**.**
Let and . The entropy of is
[TABLE]
where represents . We often abbreviate to .
Lemma 3.1 guarantees that the definition is independent of the choice of representatives , and Lemma 3.2 gives
[TABLE]
as in the real case (equation (16)). But in contrast to the real case, the term \partial_{p}\bigl{(}\sum a_{i}\bigr{)} is not always zero, and if it were omitted then the right-hand side would no longer be independent of the choice of integers .
Example 3.4**.**
Let with . Then there is a uniform distribution
[TABLE]
Choose representing . By equation (23) and then the derivation property of ,
[TABLE]
But , so . This result over is analogous to the formula for the real entropy of a uniform distribution.
Example 3.5**.**
Let . For , write , which has odd cardinality since . Directly from the definition of entropy, is given by
[TABLE]
In preparation for the next example, we record a standard lemma:
Lemma 3.6**.**
* for all . *
Proof**.**
.
Example 3.7**.**
We compute the entropy of a distribution on two elements. Choose representing . Directly from the definition of entropy, and assuming that ,
[TABLE]
But , so by Lemma 3.6, the coefficient of in the sum is . Hence
[TABLE]
The function on the right-hand side was the starting point of Kontsevich’s note [14], and we return to it in Section 9. In the case , we have for both values of .
Example 3.8**.**
Appending zero probabilities to a distribution does not change its entropy:
[TABLE]
A subtlety of distributions mod , absent in the standard real setting, is that nonzero ‘probabilities’ can sum to zero. But in general, when ,
[TABLE]
For example, when , and , Example 3.4 gives
[TABLE]
4 The chain rule
Here we formulate the mod version of the chain rule for entropy, which will later be shown to characterize entropy uniquely up to a constant.
In the Introduction, it was noted that real probability distributions can be composed in a way that corresponds to performing two random processes in sequence. The same formula (2) defines a composition of probability distributions mod , where now
[TABLE]
And entropy mod satisfies the same chain rule for composition:
Proposition 4.1** (Chain rule).**
We have
[TABLE]
*for all , all , and all . *
Proof**.**
Write {\boldsymbol{\gamma}}^{i}=\bigl{(}\gamma^{i}_{1},\ldots,\gamma^{i}_{k_{i}}\bigr{)}. Choose representing and representing , for each and . Write .
We evaluate in turn the three terms in (33). First, by Lemma 3.2 and the derivation property of (Lemma 2.4),
[TABLE]
Second, since , we have , so represents . Hence
[TABLE]
Third,
[TABLE]
The result follows.
A special case of composition is the tensor product of distributions, defined for and by
[TABLE]
In the analogous case of real distributions, is the joint distribution of two independent random variables with distributions and .
The chain rule immediately implies a logarithmic property of entropy mod :
Corollary 4.2**.**
* for all and . *
5 Unique characterization of entropy
Our main theorem is that up to a constant factor, entropy mod is the only quantity satisfying the chain rule.
Theorem 5.1**.**
Let \bigl{(}I\colon\Pi_{n}\to\mathbb{Z}/p\mathbb{Z}\bigr{)}_{n\geq 1} be a sequence of functions. The following are equivalent:
- i.
* satisfies the chain rule (that is, satisfies the conclusion of Proposition 4.1 with in place of );* 2. ii.
* for some .*
Since satisfies the chain rule, so does any constant multiple. Hence (ii) implies (i). We now begin the proof of the converse.
For the rest of the proof, let \bigl{(}I\colon\Pi_{n}\to\mathbb{Z}/p\mathbb{Z}\bigr{)}_{n\geq 1} be a sequence of functions satisfying the chain rule. Recall that denotes the uniform distribution , for .
Lemma 5.2**.**
- i.
* for all not divisible by ;* 2. ii.
.
Proof**.**
By the chain rule, has the logarithmic property
[TABLE]
for all and . In particular, for all not divisible by ,
[TABLE]
proving (i). For (ii), take in (i).
Lemma 5.3**.**
*. *
Proof**.**
We compute in two ways. On the one hand, by the chain rule,
[TABLE]
On the other, by the chain rule and the fact that ,
[TABLE]
Hence . The proof that is similar.
Lemma 5.4**.**
For all and ,
[TABLE]
Proof**.**
First suppose that . Then
[TABLE]
Applying to both sides, then using the chain rule and , gives the result. The case is proved similarly, using .
We will prove the characterization theorem by analysing as varies. The chain rule will allow us to deduce the value of for more general distributions , thanks to the following lemma.
Lemma 5.5**.**
Let with for all . For each , let be an integer representing , and write . Then
[TABLE]
Proof**.**
First note that none of is a multiple of , so and are well-defined. We have
[TABLE]
Applying to both sides and using the chain rule gives the result.
We come now to the most delicate part of the argument. Since , and since is -periodic in , if is to be a constant multiple of then must also be -periodic in . We show this directly.
Lemma 5.6**.**
* for all natural numbers not divisible by . *
Proof**.**
First we prove the existence of a constant such that for all not divisible by ,
[TABLE]
(Compare Lemma 2.1(ii).) An equivalent statement is that is independent of . Since for any and we can choose some with , it is enough to show that whenever with and ,
[TABLE]
To prove this, consider the distribution
[TABLE]
By Lemma 5.5 and the fact that ,
[TABLE]
But also
[TABLE]
so by the same argument,
[TABLE]
Comparing the two expressions for gives equation (50), thus proving the initial claim.
By induction on equation (49),
[TABLE]
for all with . The result follows by putting .
We can now prove the characterization theorem for entropy modulo .
Proof of Theorem 5.1.
Define by . Lemma 5.2, Lemma 5.6 and Proposition 2.3 together imply that for some . By Example 3.4, an equivalent statement is that for all not divisible by .
Since both and satisfy the chain rule, Lemma 5.5 applies to both; and since and are equal on uniform distributions, they are also equal on all distributions such that for all . Finally, applying Lemma 5.4 to both and , we deduce by induction that for all .
A variant of the characterization theorem will be useful. The distributions considered so far can be viewed as probability measures (mod ) on sets of the form , but it will be convenient to generalize to arbitrary finite sets.
Thus, given a finite set , write for the set of families of elements of such that . A finite probability space mod is a finite set together with an element .
As in the real case, we can take convex combinations of probability spaces. Given a finite probability space and a further family of finite probability spaces, all mod , we obtain a new probability space
[TABLE]
mod . Here is the disjoint union of the sets , and gives probability to an element .
The operation of taking convex combinations is simply composition of distributions, in different notation. Indeed, if and then the set is naturally identified with , and under this identification, corresponds to the composite distribution .
The entropy of is, of course, defined as
[TABLE]
where represents for each . It is isomorphism-invariant: whenever and are finite probability spaces mod and there is some bijection satisfying for all , then . The chain rule for entropy mod , translated into the notation of convex combinations, states that
[TABLE]
for all finite probability spaces and mod .
Corollary 5.7**.**
Let be a function assigning an element of to each finite probability space mod . The following are equivalent:
- i.
* is isomorphism-invariant and satisfies the chain rule (58) (with in place of );* 2. ii.
* for some .*
Proof**.**
That (ii) implies (i) follows from the observations above. Conversely, take a function satisfying (i). Restricting to finite sets of the form defines a sequence of functions satisfying the chain rule. By Theorem 5.1, there is some constant such that for all and . Now take any finite probability space . We have
[TABLE]
for some and , and by isomorphism-invariance of both and ,
[TABLE]
proving (ii).
Remark 5.8**.**
This corollary is slightly weaker than our main characterization result, Theorem 5.1. Indeed, if is an isomorphism-invariant function on the class of finite probability spaces mod then in particular, permuting the arguments of a measure does not change the value that gives it. Thus, the corollary also follows from a weaker version of Theorem 5.1 in which the putative entropy function is also assumed to be symmetric in its arguments. But Theorem 5.1 shows that the symmetry assumption is, in fact, unnecessary.
6 Information loss
Grothendieck came along and said, ‘No, the Riemann–Roch theorem is not a theorem about varieties, it’s a theorem about morphisms between varieties.’ —Nicholas Katz (quoted in [12], p. 1046).
The entropy of a probability space is a special case of a more general concept, the information loss of a map between probability spaces. This point is most easily explained through the real case, as follows.
Given a real probability distribution on a finite set, the entropy of is the amount of information gained by learning the result of an observation drawn from . For example, if then the entropy (to base ) is , reflecting the fact that results of draws from cannot be communicated in fewer than bits each.
In the same spirit, one can ask how much information is lost by a deterministic process. Consider, for instance, the process of forgetting the suit of a card drawn fairly from a standard -card pack. Since the four suits are distributed uniformly, bits of information are lost. An alternative viewpoint is that the information loss is the amount of information at the start of the process minus the amount at the end, which is
[TABLE]
If we take logarithms to base then the information loss is, again, bits. Hence the two viewpoints give the same result.
Generally, given a measure-preserving map between finite probability spaces, we can quantify the information lost by in either of two equivalent ways. We can condition on the outcome , taking for each the amount of information lost by collapsing the fibre :
[TABLE]
(The argument of is the distribution restricted to and normalized to sum to .) Alternatively, we can subtract the amount of information at the end of the process from the amount at the start:
[TABLE]
The two expressions (62) and (63) are equal, as we will show in the analogous mod case.
Entropy is the special case of information loss where one discards all the information. That is, the entropy of a probability distribution on a set is the information loss of the unique map from to the one-point space. In this sense, the concept of information loss subsumes the concept of entropy.
The description so far is of information loss over , which was analysed and characterized in Baez, Fritz and Leinster [2]. (In particular, equation (5) of [2] describes the relationship between information loss and conditional entropy.) We now show that a strictly analogous characterization theorem holds over , even in the absence of an information-theoretic interpretation.
Definition 6.1**.**
Let and be finite probability spaces mod . A measure-preserving map is a function such that
[TABLE]
for all .
Finite probability spaces mod and their measure-preserving maps form a category . The construction of convex combinations is functorial, in the following sense: given a finite probability space mod and a family of maps
[TABLE]
in , we have the map
[TABLE]
in that maps to . (Although the function does not depend on and would usually be written as just , it will be convenient to use this more informative notation.)
Entropy is an invariant of the objects of , and information loss is an invariant of the maps in :
Definition 6.2**.**
Let be a measure-preserving map between finite probability spaces mod . The information loss of is
[TABLE]
Lemma 6.3**.**
Let be a measure-preserving map between finite probability spaces mod . Then
[TABLE]
Proof**.**
Since entropy is unaffected by adjoining elements of probability [math], we may assume that for each . Write for the probability distribution on the set . The probability space mod is isomorphic to
[TABLE]
and the chain rule (58) then gives
[TABLE]
Information loss has some intuitively reasonable properties. First, an invertible process loses no information: whenever is an isomorphism in . This follows from the isomorphism-invariance of entropy.
Second, the information loss of two processes performed in series is the sum of the information lost by each individually:
[TABLE]
for any maps
[TABLE]
in . This is immediate from the definition.
Third, the information loss of a convex combination of two processes performed in parallel is the corresponding convex combination of their individual information losses. That is, given and maps
[TABLE]
in , we have
[TABLE]
Indeed, using the chain rule (58) and writing ,
[TABLE]
and equation (73) follows.
These three properties of information loss mod are enough to characterize it completely, up to a constant factor.
Theorem 6.4**.**
Let be a function assigning an element of to each measure-preserving map between finite probability spaces mod . The following are equivalent:
- i.
* has these three properties:*
- (a)
* for all isomorphisms ;* 2. (b)
* for all composable pairs (72) of measure-preserving maps;* 3. (c)
K\bigl{(}\lambda f\sqcup(1-\lambda)f^{\prime}\bigr{)}=\lambda K(f)+(1-\lambda)K(f^{\prime})* for all measure-preserving maps and and all ;* 2. ii.
* for some .*
Remark 6.5**.**
Like any group, can be regarded as a one-object category, and conditions (a) and (b) then imply that is a functor .
Proof of Theorem 6.4.
We have already shown that information loss satisfies the three conditions of (i), and it follows that (ii) implies (i).
For the converse, suppose that satisfies (i). Given a finite probability space , write for the unique measure-preserving map
[TABLE]
and define . For any measure-preserving map , the triangle
[TABLE]
commutes, so by condition (b),
[TABLE]
So in order to prove the theorem, it suffices to show that for some constant . And for this, it is enough to prove that satisfies the hypotheses of Corollary 5.7.
First, is isomorphism-invariant, since if is an isomorphism then , so by (82).
Second, satisfies the chain rule (58); that is,
[TABLE]
for all finite probability spaces and mod . To see this, write
[TABLE]
for the function defined by whenever . Then defines a measure-preserving map
[TABLE]
We now evaluate in two ways. On the one hand, by equation (82),
[TABLE]
On the other,
[TABLE]
so by condition (c) and induction,
[TABLE]
Comparing the two expressions for gives the chain rule (equation (58)) for , as claimed.
Corollary 5.7 therefore applies, giving for some . It follows from equation (82) that .
Theorem 6.4 has two striking features. First, the main equations that characterize information loss,
[TABLE]
are entirely linear. Despite the fact that information loss subsumes entropy, the equations are simpler in form than the characterizing equation for entropy, the chain rule.
A second striking feature of Theorem 6.4 is that the axioms on the hypothetical information loss function force to depend only on the domain and codomain of . This is an instance of a general categorical fact: for a functor from a category with a terminal object to a groupoid, whenever and are maps in with the same domain and the same codomain.
7 The residue mod of real entropy
At the end of the note [14] in which he initiated the subject of entropy modulo a prime, Kontsevich wrote:
Conclusion: If we have a random variable which takes finitely many values with all probabilities in then we can define not only the transcendental number but also its ‘residues modulo ’ for almost all primes !
Formally, given and a prime , write for the set of finite probability distributions where each is a rational number expressible as a fraction with denominator not divisible by . Then each represents an element of , and the suggestion is to view as the residue mod of the real number .
Although the quotation above was the sum total of what Kontsevich wrote on the matter, his suggestion can be developed. First, different distributions can have the same entropy over ; for instance,
[TABLE]
There is, therefore, a question of consistency: Kontsevich’s proposal only makes sense if
[TABLE]
for all and . Second, the word ‘residue’ suggests additivity: that the residue of a sum should be the sum of the residues.
We will show that both these properties are indeed satisfied: there is a well-defined, addition-preserving map
[TABLE]
Lemma 7.1**.**
Let and let be integers. Then
[TABLE]
*where the first equality is in , the second is in , and we set . *
Proof**.**
Since and , it is enough to prove the result in the case where each of the integers and is strictly positive. We may then write with and , and similarly . We adopt the convention that by default, the index ranges over and the index over .
Assume that . We have
[TABLE]
with , and similarly for . It follows that
[TABLE]
We consider each of these equations in turn.
First, since , the Fermat quotient q_{p}\bigl{(}\prod A_{i}^{a_{i}}\bigr{)} is well-defined, and the logarithmic property of (Lemma 2.1(i)) gives
[TABLE]
Consider the right-hand side as an element of . When , the -summand vanishes. When , the -summand is . Hence
[TABLE]
in . A similar result holds for , so equation (97) gives
[TABLE]
Second,
[TABLE]
so . Now
[TABLE]
and if then . A similar result holds for , so equation (98) gives
[TABLE]
in .
Finally, for each such that , we have and so in . Hence
[TABLE]
both sides being [math]. Summing equations (101), (104) and (105) gives the result.
We deduce that the real entropy of a rational distribution determines its entropy modulo :
Theorem 7.2**.**
Let , and . Then
[TABLE]
Proof**.**
We can write
[TABLE]
where , and are nonnegative integers with and
[TABLE]
By multiplying all of these integers by a constant, we may assume that .
We have
[TABLE]
with the convention that . Multiplying both sides by then raising to the power of gives
[TABLE]
By the analogous equation for and the assumption that , it follows that
[TABLE]
By Lemma 7.1, then,
[TABLE]
in . Moreover, . Hence
[TABLE]
But , so represents the element of , so by Lemma 3.2, the left-hand side of this equation is . Similarly, the right-hand side is . Hence .
It follows that Kontsevich’s residue classes of real entropies are well-defined. That is, writing
[TABLE]
there is a unique map of sets
[TABLE]
such that for all and .
Proposition 7.3**.**
The set is closed under addition, and the residue map
[TABLE]
*preserves addition. *
Proof**.**
Let and . We must show that and
[TABLE]
We will use the tensor product of real probability distributions, which is defined by the same formula as for distributions over (Section 4). Evidently , and it is an instance of the chain rule that
[TABLE]
Hence , and
[TABLE]
where the third equality is by Corollary 4.2.
Remark 7.4**.**
The set appears to have no very simple description. Evidently it is an additive submonoid of the -linear subspace of with basis . One can show that for each prime and that . However, some elements of do contain components of .
8 Entropy as a polynomial
There is an alternative approach to entropy modulo a prime. Previously, to define the entropy of a distribution mod , we had to step outside to make arbitrary choices of integers representing the ‘probabilities’, then show that the definition was independent of those choices (Definition 3.3). We now show how to define directly as a function of . That function is a polynomial, by the following classical fact:
Lemma 8.1**.**
Let be a finite field with elements, let , and let be a function. Then there is a unique polynomial of the form
[TABLE]
() such that
[TABLE]
*for all . *
Proof**.**
Write for the set of polynomials of the form (121). Write for the function induced by a polynomial in variables. Then defines a map
[TABLE]
We have to prove that is bijective. Both domain and codomain have elements, so it suffices to prove that is surjective.
First define a polynomial by
[TABLE]
For ,
[TABLE]
Now, given a function , define a polynomial by
[TABLE]
Then and .
In particular, taking , entropy modulo can be expressed as a polynomial of degree less than in each variable. For each , define by
[TABLE]
Proposition 8.2**.**
For all and ,
[TABLE]
Proof**.**
Let . We will show that whenever are integers representing , then
[TABLE]
is an integer representing . The result will follow, since if then , so by Lemma 3.1.
We have to prove that
[TABLE]
Since is invertible in , an equivalent statement is that
[TABLE]
The right-hand side of (131) is \bigl{(}\sum a_{i}\bigr{)}^{p}-\sum a_{i}^{p}, so equation (131) reduces to
[TABLE]
And since and \sum a_{i}^{p}\equiv\sum a_{i}\equiv\bigl{(}\sum a_{i}\bigr{)}^{p}\pmod{p}, this is true.
Remark 8.3**.**
The polynomial is homogeneous of degree , so the induced function on is a degree homogeneous extension of the entropy function .
Everything that we have done for can also be done for . Equation (129) expresses in terms of integers representing its arguments. As in Lemma 3.2, can equivalently be expressed as \sum\partial(a_{i})-\partial\bigl{(}\sum a_{i}\bigr{)}. The definition and characterization of information loss can be extended to finite measure spaces mod (sets equipped with an element of ), and the convexity condition (73) is then replaced by linearity conditions: and . An analogous characterization theorem over was already proved as Corollary 4 of [2].
9 Polynomial identities satisfied by entropy
We now establish further polynomial identities in , stronger than the functional equations previously proved for . The first is closely related to the chain rule, as we shall see.
Theorem 9.1**.**
Let . Then satisfies the following identity of polynomials in commuting variables over :
[TABLE]
Proof**.**
The left-hand side is equal to
[TABLE]
where the inner sum is over all such that
[TABLE]
Split the outer sum into two parts, the first consisting of the summands in which none of is equal to , and the second consisting of the summands in which one is equal to and the others are zero. Then the polynomial (133) is equal to , where
[TABLE]
We have
[TABLE]
and
[TABLE]
The result follows.
Corollary 9.2** (Polynomial chain rule).**
Let . Then satisfies the following identity of polynomials in commuting variables , over :
[TABLE]
Proof**.**
This follows from Theorem 9.1 on substituting for , using the fact that is homogeneous of degree .
The original chain rule for entropy mod (Proposition 4.1) follows: given and as in that proposition, substitute and .
The entropy polynomial in one variable is [math], by definition. But the entropy polynomial in two variables is nontrivial and satisfies a cocycle condition:
Corollary 9.3**.**
The two-variable entropy polynomial satisfies the polynomial identity
[TABLE]
Similar results appear in Cathelineau [5] (p. 58–59), Kontsevich [14], and Elbaz–Vincent and Gangl [10] (Section 2.3).
Proof**.**
Theorem 9.1 with and gives
[TABLE]
and similarly,
[TABLE]
The result follows.
We are especially interested in the case where the arguments of the entropy function sum to . Under that restriction, reduces to a simple expression:
Proposition 9.4**.**
If , there is an identity of polynomials
[TABLE]
and if , there is an identity of polynomials
[TABLE]
Proof**.**
The case is trivial; suppose otherwise. In Example 3.7, we proved the equality of functions
[TABLE]
(). We now have to prove that this is a polynomial identity. By Lemma 8.1, it suffices to show that the polynomial
[TABLE]
has degree strictly less than . Since it plainly has degree at most , we only need to show that the coefficient of vanishes.
The coefficient of in is
[TABLE]
For ,
[TABLE]
in , using first the fact that and then Lemma 3.6. Hence the coefficient of in is . But defines a permutation of , so the sum is equal to , which is [math] since is odd.
Following Elbaz-Vincent and Gangl [9], we write
[TABLE]
(Elbaz-Vincent and Gangl assumed that .) Despite the lack of formal resemblance, is the mod analogue of the real function
[TABLE]
Since is evidently a symmetric polynomial,
[TABLE]
in . The polynomial also satisfies a more complicated identity whose significance will be explained shortly. Following Kontsevich [14], Elbaz-Vincent and Gangl proved:
Proposition 9.5** (Elbaz-Vincent and Gangl).**
There is a polynomial identity
[TABLE]
Both sides of this equation are indeed polynomials, as . Elbaz-Vincent and Gangl proved it using differential equations (Proposition 5.9(2) of [9]), but it also follows easily from the cocycle identity for :
Proof**.**
Since is homogeneous of degree ,
[TABLE]
The identity to be proved is, therefore, equivalent to
[TABLE]
Since is symmetric, this in turn is equivalent to
[TABLE]
which is an instance of the cocycle identity of Corollary 9.3.
Proposition 9.5 can be understood as follows. Any finite probability distribution can be expressed as an iterated composite of distributions on two elements. Hence, using the chain rule, the entropy of any distribution can be computed in terms of entropies of distributions on two elements. In this sense, the sequence of functions \bigl{(}H\colon\Pi_{n}\to\mathbb{Z}/p\mathbb{Z}\bigr{)}_{n\geq 1} reduces to the single function , which is effectively a function in one variable:
[TABLE]
A similar reduction can be performed over .
On the other hand, an arbitrary function cannot generally be extended to a sequence of functions satisfying the chain rule (nor, similarly, in the real case). Indeed, by expressing a distribution as a composite in two different ways, we obtain an equation that must satisfy if such an extension is to exist. Assuming the symmetry property , that equation is
[TABLE]
(); compare Proposition 9.5.
Equation (160) is sometimes called the ‘fundamental equation of information theory’. Thus, Proposition 9.5 is a polynomial version mod of the fundamental equation. Over , it has been studied since at least 1958 [16]. Assuming that is symmetric, the fundamental equation is the only obstacle to the extension problem, in the sense that if satisfies (160) then the extension can be performed.
In the real case, the function (151) is a solution of the fundamental equation. Up to a scalar multiple, it is the only measurable solution of the fundamental equation satisfying . It can be deduced that up to a constant factor, Shannon entropy for finite real probability distributions is characterized uniquely by measurability, symmetry and the chain rule (Lee [15]).
In the mod case, we know that the function is symmetric and satisfies the fundamental equation. Since any such function can be extended to a sequence of functions satisfying the chain rule, it follows from Theorem 5.1 that up to a constant factor, is the unique symmetric solution of the fundamental equation.
Remark 9.6**.**
In his seminal note [14], Kontsevich unified the real and mod cases with a homological argument, using a cocycle identity equivalent to that in Corollary 9.3. In doing so, he established that is the correct formula for the entropy mod of a distribution mod on two elements (assuming, as he did, that ). Although he gave no definition of the entropy of a probability distribution mod on an arbitrary finite number of elements, his arguments showed that a unique reasonable such definition must exist.
The present work develops the framework hinted at in [14], and provides the further definition and characterization of information loss mod . It also makes two improvements to [14].
The first is the streamlined inclusion of the case . The second is the dropping of all symmetry requirements. In axiomatic approaches to entropy based on the fundamental equation of information theory (160), such as those of Lee [15] and Kontsevich, the symmetry axiom is essential. Indeed, is also a solution of (160), and similarly, the polynomial identity of Proposition 9.5 is also satisfied by in place of . The symmetry axiom is used to rule out these and other undesired solutions. This is why Lee’s characterization of real entropy needed the assumption that it is symmetric in its arguments. In contrast, symmetry is needed nowhere in the approach that we have taken.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. M. Apostol. Introduction to Analytic Number Theory . Undergraduate Texts in Mathematics. Springer, 1976.
- 2[2] J. Baez, T. Fritz, and T. Leinster. A characterization of entropy in terms of information loss. Entropy , 13:1945–1957, 2011.
- 3[3] P. Baudot and D. Bennequin. The homological nature of entropy. Entropy , 17:3253–3318, 2015.
- 4[4] A. Buium. Differential characters of abelian varieties over p 𝑝 p -adic fields. Inventiones Mathematicae , 122:309–340, 1995.
- 5[5] J.-L. Cathelineau. Sur l’homologie de SL 2 subscript SL 2 \mathrm{SL}_{2} à coefficients dans l’action adjointe. Mathematica Scandinavica , 63:51–86, 1988.
- 6[6] J.-L. Cathelineau. Remarques sur les différentielles des polylogarithmes uniformes. Annales de l’Institut Fourier , 46:1327–1347, 1996.
- 7[7] C. Deninger. p 𝑝 p -adic entropy and a p 𝑝 p -adic Fuglede–Kadison determinant. In Y. Tschinkel and Y. Zarhin, editors, Algebra, Arithmetic, and Geometry , volume 269 of Progress in Mathematics , pages 423–442. Birkhäuser, Boston, 2009.
- 8[8] G. Eisenstein. Neue Gattung zahlentheoretischen Funktionen, die von zwei Elementen abhängen und durch gewisse lineare Funktional-Gleichungen definirt werden. Bericht über die zur Bekanntmachung geeigneten Verhandlungen der Königlich Preussischen Akademie der Wissenschaften zu Berlin , pages 36–42, 1850.
