Faster truncated integer multiplication
David Harvey

TL;DR
This paper introduces faster algorithms for computing specific parts of integer products, reducing computation time to about 75% of full multiplication, under certain algorithmic assumptions.
Contribution
It presents novel algorithms that efficiently compute either the low or high bits of integer products, improving over traditional methods for large integers.
Findings
Algorithms achieve approximately 75% of the time of full multiplication.
Applicable when integer multiplication relies on cyclic convolution computations.
Significantly speeds up partial product computations for large integers.
Abstract
We present new algorithms for computing the low bits or the high bits of the product of two -bit integers. We show that these problems may be solved in asymptotically 75% of the time required to compute the full -bit product, assuming that the underlying integer multiplication algorithm relies on computing cyclic convolutions of real sequences.
| low product | high product | full product | GMP | |||
|---|---|---|---|---|---|---|
| 1 000 000 | 2.90ms | (0.94) | 2.93ms | (0.95) | 3.07ms | 2.68ms |
| 2 154 434 | 7.11ms | (1.02) | 7.27ms | (1.05) | 6.95ms | 6.93ms |
| 4 641 588 | 14.7ms | (0.95) | 15.2ms | (0.99) | 15.4ms | 16.6ms |
| 10 000 000 | 36.2ms | (0.92) | 37.8ms | (0.96) | 39.5ms | 39.1ms |
| 21 544 346 | 88.6ms | (0.89) | 92.6ms | (0.93) | 99.2ms | 99.4ms |
| 46 415 888 | 204ms | (0.90) | 210ms | (0.93) | 227ms | 237ms |
| 100 000 000 | 504ms | (0.84) | 514ms | (0.86) | 598ms | 553ms |
| 215 443 469 | 1.25s | (0.91) | 1.28s | (0.93) | 1.37s | 1.35s |
| 464 158 883 | 2.76s | (0.91) | 2.81s | (0.93) | 3.03s | 3.05s |
| 1 000 000 000 | 6.08s | (0.86) | 6.19s | (0.88) | 7.05s | 6.93s |
| 2 154 434 690 | 13.9s | (0.86) | 14.2s | (0.88) | 16.1s | 17.0s |
| 4 641 588 833 | 33.6s | (1.01) | 34.6s | (1.04) | 33.4s | 38.1s |
| 10 000 000 000 | 109s | (1.37) | 110s | (1.38) | 79.8s | 81.6s |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCoding theory and cryptography · Numerical Methods and Algorithms · Cryptography and Residue Arithmetic
Faster truncated integer multiplication
David Harvey
School of Mathematics and Statistics, University of New South Wales, Sydney NSW 2052, Australia
Abstract.
We present new algorithms for computing the low bits or the high bits of the product of two -bit integers. We show that these problems may be solved in asymptotically of the time required to compute the full -bit product, assuming that the underlying integer multiplication algorithm relies on computing cyclic convolutions of sequences of real numbers.
1. Introduction
Let and let and be integers in the interval . We write for the cost of computing the full product of and , which is just the usual -bit product . Unless otherwise specified, by ‘cost’ we mean the number of bit operations, under a model such as the multitape Turing machine [14].
In this paper we are interested in two types of truncated product. The low product of and is the unique integer in the interval such that , or in other words, the low bits of . We denote the cost of computing the low product by .
The high product of and consists of the high bits of , except that we allow a small error in the lowest bit. More precisely, the high product is defined to be any integer in the range such that . Thus there are at most two possible values for the high product, and an algorithm that computes it is permitted to return either one. We denote the cost of computing the high product by .
There are many applications of truncated products in computer arithmetic. The most obvious example is high-precision arithmetic on real numbers: to compute an -bit approximation to the product of two real numbers with -bit mantissae, we may scale by an appropriate power of two to convert the inputs to -bit integers, and then compute the high product of those integers. Further examples include Barrett’s [1] and Montgomery’s [12] algorithms for modular arithmetic.
It is natural to ask whether a truncated product can be computed more quickly than a full product. This is indeed the case for small : in the classical quadratic-time regime, we can compute a truncated product in about half the time of a full product, because essentially only half of the bit-by-bit products contribute to the desired output.
However, as grows, and more sophisticated multiplication algorithms are deployed, these savings begin to dissipate. Consider for instance Karatsuba’s algorithm, which has complexity for . Mulders showed [13] that Karatsuba’s algorithm may be adapted to obtain bounds for and around . However, it is not known how to reach in this regime.
For much larger values of , the most efficient integer multiplication algorithms known are based on FFTs (fast Fourier transforms). Currently, the asymptotically fastest such algorithm has complexity [10], and it is widely believed that this bound is optimal up to a constant factor.
It has long been thought that the best way to compute a truncated product using FFT-based algorithms is to simply compute the full product and then discard the unwanted part of the output. One might be able to save bit operations compared to the full product by skipping computations that do not contribute to the desired half of the output, but no bounds of the type or have been proved for any constant .
For some closely related problems, one can actually prove that it is not possible to do better than computing the full product. For example, in a suitable algebraic model, the multiplicative complexity of any algorithm that computes the low coefficients of the product of two polynomials of degree less than is at least [4, Thm. 17.14], which is the same as the multiplicative complexity of the full product. By analogy, one might expect the same sort of lower bound to apply to truncated integer multiplication.
In this paper we show that this belief is mistaken: we present algorithms that compute high and low products of integers in asymptotically of the time required for a full product. The new algorithms require that the underlying integer multiplication is carried out via a cyclic convolution of sequences of real numbers. This includes any real convolution algorithm based on FFTs, and in particular the multiplication algorithm of [10].
Unfortunately, because the new methods rely heavily on the archimedean property of , we do not yet know how to obtain this 25% reduction in complexity for arbitrary integer multiplication algorithms. In particular, we are currently unable to establish analogous results for integer multiplication algorithms based on FFTs over other rings, such as finite fields [15].
Although we focus on time complexity in this paper, the new techniques also have implications for space complexity. For example, to multiply two floating-point numbers with -bit mantissae using standard FFT methods, the transform of each multiplicand occupies roughly bits of storage. This is true regardless of the type of FFT algorithm used; it holds for FFTs over finite fields just as well as for real or complex FFTs. Using the new methods, the storage required drops to roughly bits. This improvement in space complexity may be significant in applications where storage is the bottleneck, such as extremely high-precision evaluation of numerical constants such as . Furthermore, if the computation is I/O-bound, then this may lead directly to a corresponding improvement in computation time. We will not explore this issue further in this paper.
The remainder of the paper is structured as follows. In Section 2 we state our main results precisely, after giving some preliminary definitions. Section 3 presents the new algorithm for the low product, including the proof of correctness and complexity analysis. Section 4 does the same for the high product. Section 5 gives some performance data for an implementation of the new algorithms.
Historical note. An earlier version of this paper pointed out that the new methods for truncated multiplication may be used to design an integer multiplication algorithm having complexity with . At the time, this was the best known asymptotic bound for . This result was subsequently superseded by [8] (), [9] (), and then [10] ().
2. Setup and statement of results
2.1. Fixed point arithmetic and real convolutions
We write for . To simplify analysis of numerical error, all algorithms are assumed to work with the following fixed-point representation for real numbers. (See [11, §3] for a more detailed treatment.) Let be a precision parameter. We write for the set of real numbers of the form where is an integer in the interval . Thus models the unit interval , and elements of are represented using bits of storage. For , we write for the set of real numbers of the form where . An element of is represented simply by its mantissa in ; the exponent is always known from context, and is not explicitly stored.
We will frequently work with quotient rings of the form where is some fixed monic polynomial of positive degree, such as . If and , we write for the coefficients of with respect to the standard monomial basis; that is, . For such we define a norm
[TABLE]
We write for the set of polynomials whose coefficients lie in ; this is a slight abuse of notation, as is not really a ring. Algorithms always represent such a polynomial by its coefficient vector .
We assume that we have available a subroutine Convolution with the following properties. It takes as input two parameters and , and polynomials
[TABLE]
Let ; more explicitly,
[TABLE]
Then Convolution is required to output a polynomial
[TABLE]
such that
[TABLE]
In other words, Convolution computes a -bit approximation to the cyclic convolution of two real input sequences of length .
We write for the bit complexity of Convolution. We treat this routine as a black box; its precise implementation is not important for our purposes. A typical implementation would execute a real-to-complex FFT for each input sequence, multiply the Fourier coefficients pointwise, and then compute an inverse complex-to-real transform to recover the result. Internally, it should work to precision slightly higher than to control rounding errors during intermediate computations. (For an explicit error bound, see for example [3, Theorem 3.6].) The routine may completely ignore the exponent parameter .
2.2. The full product
For completeness, we recall the well-known algorithm that uses Convolution to compute the full product of two -bit integers (Algorithm 2.1 below). It depends on two parameters: a chunk size , and a transform length , where . The idea is to cut the integers into chunks of bits, thereby reducing the integer multiplication problem to the problem of multiplying two polynomials in modulo .
We will not discuss in this paper the question of optimising the choice of and . The optimal choice of will involve some balance between making as close to as possible, but also ensuring that is sufficiently smooth (has only small prime factors) so that FFTs of length are as efficient as possible. (An alternative approach is to use “truncated FFTs” [18], which eliminates the need to choose a smooth transform length. However, this makes no difference asymptotically. Despite the overlapping terminology, it is not clear whether the new truncated multiplication algorithms can be adapted to the case of truncated FFTs. This is an interesting question for future research.)
Theorem 2.1** (Full product).**
Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 2.1 correctly computes the full product of and . Assuming that , its complexity is
[TABLE]
Proof.
The condition ensures that the decompositions of and into and in line 2 are legal. Let
[TABLE]
and
[TABLE]
Note that \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}} are the images of and in , and by construction and . Since has degree at most , it is determined by its remainder modulo . Line 3 computes an approximation \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W}} to this remainder with
[TABLE]
for each . The function in line 4 rounds its argument to the nearest integer, with ties broken in either direction as convenient. Since , we deduce that \operatorname{round}(\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W_{i}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W_{i}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W_{i}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W_{i}}})=w_{i} for each ; hence line 4 returns .
The main term in the complexity bound arises from the Convolution call in line 3. The secondary term consists of the splitting step in line 2, which costs bit operations per coefficient, and the overlap-add procedure in line 4, which requires bit operations per coefficient. ∎
2.3. Statement of results
The main results of this paper are the following analogues of Theorem 2.1 for the low product and high product. These are proved in Section 3 and Section 4 respectively.
Theorem 2.2** (Low product).**
Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 3.1 (see §3) correctly computes the low product of and . Assuming that , its complexity is
[TABLE]
Theorem 2.3** (High product).**
Let , and let and be -bit integers. Let and be integers such that . Then Algorithm 4.1 (see §4) correctly computes the high product of and . Assuming that , its complexity is
[TABLE]
Comparing these complexity bounds to Theorem 2.1 (the full product), we observe that the convolution length has dropped from to , but the working precision has increased from roughly to roughly . To understand the implications for the overall complexity, we need to make further assumptions on the behaviour of . We consider two scenarios.
Scenario #1: asymptotic behaviour as . Assume that the transform length is restricted to suitably smooth values, such as the ultrasmooth numbers defined by Bernstein [2], so that asymptotically 100% of the FFT work is performed using radix-2 transforms. Assume also that the working precision is exponentially smaller than , but somewhat larger than , say . Under these assumptions it is reasonable to expect that the complexity of the underlying real convolution is quasi-linear with respect to the the total bit size . In particular, for any absolute constants and , where is an integer, we should have
[TABLE]
This is the case for all FFT-based convolution algorithms known to the author.
Now, given some large , assume that we choose and such that and , as is done for example in [11, §6]. Then
[TABLE]
so according to Theorem 2.1 we have
[TABLE]
On the other hand, for the low product, Theorem 2.2 yields
[TABLE]
We conclude that
[TABLE]
justifying our assertion that asymptotically the new low product algorithm saves 25% of the total work compared to the full product. Similar remarks apply to the high product.
Scenario #2: fixed word size. Now let us consider the situation faced by a programmer working on a modern microprocessor with hardware support for a fixed word size, such as the 53-bit double-precision floating point type provided by the IEEE 754 standard. In this setting, the Convolution subroutine takes as input two vectors of coefficients represented by this data type, and computes their cyclic convolution using some sort of FFT, taking direct advantage of the hardware arithmetic. We assume that is chosen as large as possible so that the FFTs can be performed in this way; for example, under IEEE 754 we would require that for the low product, where is an allowance for numerical error. Obviously in this scenario it does not make sense to allow , and it also does not quite make sense to measure complexity by the number of “bit operations”. Instead, should be restricted to lie in some finite (possibly quite large) range, and a more natural measure of complexity is the number of word operations (ignoring issues such as locality and parallelism).
We claim that it is still reasonable to expect a reduction in complexity close to 25%. To see this, consider a full product computation for a given , with splitting parameters and . Let and be the splitting parameters for the corresponding truncated product, for the same value of . We should choose around to ensure that we still take maximum advantage of the available floating-point type. Then we should choose around to compensate for the smaller chunks. Now observe that (for large ) the bulk of the work for the full product consists of FFTs of length , but for the truncated products the FFT length is reduced to around . Since the FFTs run in quasilinear time (i.e., word operations), we expect to see roughly 25% savings.
In practice the expected 25% speedup will be tempered somewhat by the additional linear-time work inherent in the truncated product algorithms, such as the evaluation of and in Algorithm 3.1. The situation is also complicated by the fact that we are constrained to choose smooth transform lengths. Section 5 gives some timings, showing the speedup actually achieved by an implementation.
3. The low product
The aim of this section is to prove Theorem 2.2. Throughout the section we fix integers
[TABLE]
as in the statement of the theorem.
3.1. The cancellation trick
The key to the new low product algorithm is the following simple observation.
Proposition 3.1**.**
Let with , say
[TABLE]
Let be the remainder on dividing by
[TABLE]
with . Then and
[TABLE]
Proof.
Write
[TABLE]
Since , we have
[TABLE]
The polynomial on the right hand side has degree at most , so we deduce that
[TABLE]
This shows that , and the result follows on substituting . ∎
Later we will apply Proposition 3.1 to a polynomial analogous to the encountered earlier in the proof of Theorem 2.1. The proposition shows that after reducing modulo and making the substitution , the term in causes the unwanted high-order coefficients of to disappear; see Figure 1. An alternative point of view is that polynomial multiplication modulo corresponds roughly to integer multiplication modulo
[TABLE]
To make use of Proposition 3.1 to compute a low product, we must compute exactly. Note that the coefficients of lie in rather than . Consequently, to compute , we must increase the working precision by bits compared to the precision used in the full product algorithm. This is why the precision parameter in Theorem 2.2 (and Theorem 2.3) is rather than .
3.2. The roots of
In this section we study the complex roots of the special polynomial introduced in Proposition 3.1. For , let denote the open disc .
Lemma 3.2**.**
The roots of lie in , and they are all simple.
Proof.
If is a root of and , then (3.1) implies that
[TABLE]
which is impossible.
Any multiple root of would have to satisfy
[TABLE]
and hence
[TABLE]
This implies that , contradicting . ∎
Now consider the function
[TABLE]
where means the branch that maps to .
Lemma 3.3**.**
The function maps roots of to roots of .
Proof.
If is a root of , then
[TABLE]
In fact, always sends a root of to the root of nearest to it, but we will not prove this. Figure 2 illustrates the situation for and , showing that the roots of are very close to those of . (For the roots are already too close together to distinguish at this scale.)
For any , the binomial theorem implies that is represented on by the series
[TABLE]
where
[TABLE]
In particular, the first few terms of are given by
[TABLE]
We will need to construct an explicit functional inverse for , in order to map the roots of back to the corresponding roots of . Let be the formal power series inverse of , i.e., so that
[TABLE]
The coefficients of , and of its powers, are given as follows.
Lemma 3.4**.**
For any we have (formally)
[TABLE]
where and
[TABLE]
In particular, the first few terms of are
[TABLE]
Proof.
By the Lagrange inversion formula [17, Thm. 5.4.2], for any we have
[TABLE]
Taking , for any , yields
[TABLE]
Remark 3.5*.*
It is also possible to write down an explicit formula for when , but the above argument fails because is zero when . To handle the case one needs a slightly stronger form of the Lagrange inversion formula; see for example [6, Thm. 2.1.1]. In this paper we only need the case .
The next result gives some simple bounds for the coefficients and .
Lemma 3.6**.**
For all and we have
[TABLE]
Proof.
The bounds are trivial for , so assume that . For we have
[TABLE]
For , observe that
[TABLE]
where . We have and
[TABLE]
since and (see (3.1)); hence . Thus
[TABLE]
Corollary 3.7**.**
The series for and converge on , and
[TABLE]
Proof.
We already know that converges on , and the convergence of on follows from Lemma 3.6. If , then
[TABLE]
where the last inequality follows from (3.1). This shows that maps into . A similar argument shows that maps into . Since both and map into the disc of convergence of the other, and since they are inverses formally, they must be inverse functions in the sense of (3.2). ∎
Corollary 3.8**.**
The functions and induce mutually inverse bijections between the roots of and the roots of .
Proof.
By Lemma 3.2, the polynomial has distinct roots in . Lemma 3.3 shows that maps to roots of , and the images must be distinct because Corollary 3.7 implies that is injective on . Since has exactly roots, every root of must be the image of some , and then must map this root back to . ∎
3.3. Ring isomorphisms
The aim of this section is construct a pair of mutually inverse ring isomorphisms
[TABLE]
In the main low product algorithm, the role of these maps will be to convert the problem of multiplying two polynomials modulo into an ordinary cyclic convolution.
The idea of the construction is that for , we want to define to be the composition , regarded as a polynomial modulo , and similarly for . However, some care is required in interpreting the expression , as is not a polynomial, but rather a power series. To make this definition precise, we proceed as follows.
For each define linear maps
[TABLE]
by the formulas
[TABLE]
These maps satisfy the following norm bounds. (Recall that the norm on polynomials is defined as in (2.1).)
Lemma 3.9**.**
For any and ,
[TABLE]
Proof.
By definition, where
[TABLE]
For any we have , because multiplication by simply permutes the coefficients cyclically. Applying this observation to repeatedly, and recalling Lemma 3.6, we find that
[TABLE]
Lemma 3.10**.**
For any and ,
[TABLE]
Proof.
The argument is similar to Lemma 3.9, the main difference being that multiplication by modulo is slightly more complicated than a cyclic permutation. Let . Since , we have
[TABLE]
so .
Now, since where , using Lemma 3.6 we obtain
[TABLE]
We may now define the maps and in (3.3) by setting
[TABLE]
Lemma 3.9 and Lemma 3.10 guarantee that these series converge coefficientwise, so and are well-defined, and they are clearly linear maps. Moreover, we immediately obtain the following estimates concerning the partial sums of the series.
Lemma 3.11**.**
For any and any integer we have
[TABLE]
Proof.
For the first claim, observe that
[TABLE]
by Lemma 3.9 and (3.1). The second estimate is proved in a similar way. ∎
Lemma 3.12**.**
For any and any integer we have
[TABLE]
Proof.
Follows from Lemma 3.10, similarly to the proof of Lemma 3.11. ∎
Now we can establish that and behave like the desired compositions and .
Lemma 3.13**.**
Let , and let be a root of . Then
[TABLE]
Remark 3.14*.*
The expression is well-defined since is a root of (see Corollary 3.8).
Proof.
By the definition of , and since is a root of , we have
[TABLE]
Thus
[TABLE]
Lemma 3.15**.**
Let , and let be a root of . Then
[TABLE]
Proof.
Similar to the proof of Lemma 3.13. ∎
Corollary 3.16**.**
The maps and are mutually inverse ring isomorphisms between and .
Proof.
We have already pointed out that and are linear; to show that they are ring homomorphisms we must show that they also respect multiplication. Lemma 3.13 implies that for any and any root of , we have
[TABLE]
Since a polynomial in is determined by its values at the roots of , this shows that , and hence that is a ring homomorphism. A similar argument using Lemma 3.15 shows that is a ring homomorphism.
To show that and are inverses, let and let be a root of . Corollary 3.8 implies that
[TABLE]
Since this holds for all roots of , we see that . A similar argument shows that for all . ∎
Finally, we have the following two results concerning the complexity of approximating and .
Proposition 3.17** (Approximating ).**
Given as input , we may compute such that
[TABLE]
in bit operations, assuming that .
Note that the output coefficients can indeed be represented in thanks to the bound (Lemma 3.11 with ). A similar remark applies to Proposition 3.18 below (via Lemma 3.12).
Proof.
Let ; the hypothesis implies that . According to Lemma 3.11,
[TABLE]
and
[TABLE]
To compute the desired such that , it suffices to ensure that satisfies
[TABLE]
This may be accomplished by simply evaluating the sum directly from the definition, with a sufficiently high working precision.
In more detail, we first calculate the coefficients , for each and . Each one requires operations in , using the usual formula for the binomial coefficients. Next we compute the coefficients of the polynomials
[TABLE]
and so on, up to . This costs altogether operations in . Taking the sum of these polynomials costs another operations in . To ensure that (3.4) holds, it suffices to perform all of these operations with a working precision of significant bits. The details of this error analysis are routine and are omitted. Each such addition, multiplication or division in costs bit operations, leading to the claimed complexity bound. ∎
Proposition 3.18** (Approximating ).**
Given as input , we may compute such that
[TABLE]
in bit operations, assuming that .
Proof.
Taking , the proof proceeds along similar lines to that of Proposition 3.17, replacing the use of Lemma 3.11 by Lemma 3.12. The main difference is that the reductions modulo lead to slightly more complicated formulas. For example, we have
[TABLE]
The terms with the minus signs are those arising from the term in . Overall, there are no more than of these additional terms compared to the proof of Proposition 3.17. ∎
Remark 3.19*.*
In the estimates given above, such as Lemma 3.6 and Lemma 3.9, we have opted for shorter proofs rather than the sharpest possible bounds. With more effort, one could prove tighter bounds; this might save a few bits in the main algorithm, but does not affect the asymptotic conclusions of the paper. Similar remarks apply to the high product algorithm in Section 4.
3.4. The main algorithm
We are now in a position to state Algorithm 3.1 and prove the main theorem concerning the computation of the low product.
Proof of Theorem 2.2.
As in the proof of Theorem 2.1, let
[TABLE]
so that and , and let
[TABLE]
The polynomials \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}} in line 2 are just the images of and in . Our goal is to compute , the remainder on dividing by , as in Proposition 3.1. By definition this is equal to \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}.
Line 3 computes approximations \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{V}} to \alpha^{*}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \alpha^{*}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}. Line 4 computes \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{W}}, an approximation to \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{U}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{V}} (the cyclic convolution of \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{V}}). Observe that
[TABLE]
In this calculation we have used the fact that is a ring homomorphism (Corollary 3.16), and that for any . By Lemma 3.11 (with ) we have
[TABLE]
so
[TABLE]
Line 5 computes \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W}}, an approximation to \beta^{*}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{W}}. Since and are inverses (Corollary 3.16), Lemma 3.12 implies that
[TABLE]
where the last inequality follows from our choice of in line 1.
On the other hand, we know from Proposition 3.1 that has integer coefficients, so we deduce that \operatorname{round}(2^{b}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W_{i}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W_{i}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W_{i}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W_{i}}})=2^{b}L_{i} for each . Therefore the sum in line 6 is equal to ; by Proposition 3.1 this is equal to
[TABLE]
This congruence also holds modulo as .
The main term in the complexity bound (2.2) arises from the Convolution call in line 4. The splitting and overlap-add steps in lines 2 and 6 contribute bit operations, as (by hypothesis), and the invocations of Proposition 3.17 and Proposition 3.18 in lines 3 and 5 contribute another bit operations. ∎
4. The high product
The discussion for the high product runs along similar lines to the low product, with one additional technical complication. The polynomial that naturally replaces in the cancellation trick (see Proposition 4.2) has roots near the roots of , just like , but it also has a real root near . Some extra work is needed to handle this additional root.
Remark 4.1*.*
The asymmetry between the high and low products is somewhat mysterious. Perhaps it is related to the fact that in integer arithmetic, carries always propagate towards the most significant bits. The author has so far been unable to find a way of avoiding the annoying additional root.
Throughout this section we continue to assume that (3.1) holds, i.e., that and .
4.1. The cancellation trick
We begin with a suitable analogue of Proposition 3.1. To motivate our strategy, recall that the cancellation trick for the low product relied on the fact that
[TABLE]
Working modulo has the effect of shifting the high-order coefficients downwards by coefficients, while at the same time multiplying them by , so that they will later cancel out when we evaluate at . For the high product, we want to instead multiply the low-order coefficients by (to make them later cancel out), and simultaneously shift the high-order coefficients downward by coefficients. We can accomplish these goals together by multiplying by modulo a polynomial with the property that
[TABLE]
More precisely, we have the following result.
Proposition 4.2**.**
Let with , say
[TABLE]
Let be the remainder on dividing by
[TABLE]
with . Then and
[TABLE]
Proof.
Write
[TABLE]
Multiplying by , and using the congruence , we obtain
[TABLE]
The polynomial on the right hand side has degree at most , so we deduce that
[TABLE]
This shows that , and the result follows on substituting . ∎
4.2. The roots of
The next result isolates the auxiliary real root of .
Lemma 4.3**.**
The polynomial has a unique real root in the interval
[TABLE]
In particular,
[TABLE]
Remark 4.4*.*
It turns out that is very close to , i.e., the midpoint of the interval (4.1). In fact, one can develop a series expansion for , whose first few terms are given by
[TABLE]
but we will not prove this.
Proof.
It is convenient to make the transformation
[TABLE]
Our goal is to show that has a unique real root in the interval .
Clearly . We claim that . To see this, observe that
[TABLE]
From (3.1) we have and then , so indeed . By the intermediate value theorem, has at least one root in .
To prove that there is exactly one root, we will show that throughout the interval. We have , so it suffices to show that , i.e., that . This is clear as .
Finally, (4.2) follows immediately from (4.1), after taking into account (3.1). ∎
We note for future use the identity
[TABLE]
which follows immediately from the fact that .
Next consider the polynomial
[TABLE]
The coefficients of are given explicitly as follows.
Lemma 4.5**.**
We have
[TABLE]
Proof.
First observe that
[TABLE]
From (4.3) we have and hence
[TABLE]
Lemma 4.6**.**
The roots of lie in , and they are all simple.
Proof.
If is a root of and , then
[TABLE]
so by (4.2) and (3.1) we obtain
[TABLE]
If had a multiple root, say , then would also be a multiple root of . This would imply that
[TABLE]
which in turn forces (since clearly ). This contradicts the previous paragraph, as does not lie in . ∎
Lemma 4.3 and Lemma 4.6 together imply that has distinct roots, namely, the roots of , and the auxiliary root . Figure 3 illustrates the case , .
Now consider the function
[TABLE]
This is the same as the definition of in Section 3.2, except that the exponent has been replaced by . The roots of lie well within the domain of definition of . The auxiliary root is also inside the domain, but lies very close to the boundary.
Lemma 4.7**.**
The function maps roots of to roots of .
Proof.
If is a root of , then
[TABLE]
Of course, cannot yield a bijection between the roots of and those of , as has too many roots. In a moment we will see that we do get a bijection if we restrict to the roots of . (It turns out that , so maps precisely two roots of to . We omit the easy proof.)
For any , the function is represented on by the series
[TABLE]
where
[TABLE]
Again, is identical to , except that has the opposite sign. The first few terms in the expansion of are
[TABLE]
Let be the formal series inverse of .
Lemma 4.8**.**
For any we have (formally)
[TABLE]
where and
[TABLE]
In particular, the first few terms of are
[TABLE]
Proof.
Same as the proof of Lemma 3.4, with replaced by everywhere. ∎
Lemma 4.9**.**
For all and we have
[TABLE]
Proof.
The bound for follows by the same argument used for in the proof of Lemma 3.6. For , observe that for we have
[TABLE]
Stirling’s formula implies that , so since we obtain
[TABLE]
Remark 4.10*.*
The constant in the above bound for ensures that the statement is correct for all and , but asymptotically speaking it is not really necessary. In fact one can prove that for any , there exist and such that for all , whenever and .
Corollary 4.11**.**
The series for and converge on and respectively, and
[TABLE]
Proof.
Same as the proof of Corollary 3.7, first using Lemma 4.9 to show that maps into and that maps into . ∎
Corollary 4.12**.**
The functions and induce mutually inverse bijections between the roots of and the roots of .
Proof.
Similar to the proof of Corollary 3.8. ∎
4.3. Ring isomorphisms
In this section we will first construct maps
[TABLE]
analogous to the maps and defined in Section 3.3. Note that these maps do not yet take into account the auxiliary root .
For each define linear maps
[TABLE]
by the formulas
[TABLE]
As in Section 3.3 we have the following norm bounds.
Lemma 4.13**.**
For any and ,
[TABLE]
Proof.
Similar to the proof of Lemma 3.9, using Lemma 4.9 to bound the series coefficients. ∎
Lemma 4.14**.**
For any and ,
[TABLE]
Proof.
As in the proof of Lemma 3.10, we must first work out the effect of multiplication by modulo . Let . The formula (4.4) implies that
[TABLE]
We have and for (due to (4.2) and (3.1)), so we find that .
The rest of the argument is the same as the proof of Lemma 3.10, noting that for , and using Lemma 4.9. ∎
We now define and by setting
[TABLE]
The next five statements are proved along the same lines as the corresponding results in Section 3.3, i.e., from Lemma 3.11 up to Corollary 3.16.
Lemma 4.15**.**
For any and any integer , we have
[TABLE]
Lemma 4.16**.**
For any and any integer , we have
[TABLE]
Lemma 4.17**.**
Let , and let be a root of . Then
[TABLE]
Lemma 4.18**.**
Let , and let be a root of . Then
[TABLE]
Corollary 4.19**.**
The maps and are mutually inverse ring isomorphisms between and .
Now we bring back into the picture. We will define maps
[TABLE]
in terms of and , as follows.
First, for , we define
[TABLE]
The map is a linear isomorphism, thanks to the Chinese remainder theorem applied to the relatively prime moduli and . However, is not quite a ring isomorphism, i.e., is not multiplicative, due to the scaling factor . Note that the second component of may be written more explicitly as follows: if , then
[TABLE]
In the other direction, for
[TABLE]
we define to be the unique polynomial such that
[TABLE]
Again, is a linear isomorphism, but not a ring isomorphism.
In fact, and are not even inverse to each other. Instead, they have been cooked up to satisfy the following relation.
Lemma 4.20**.**
For any we have
[TABLE]
(The reason for including the factor is to prepare for the use of Proposition 4.2 in the main high product algorithm.)
Proof.
It is enough to check the equality modulo and modulo . It holds modulo according to Corollary 4.19:
[TABLE]
It holds modulo thanks to (4.3):
[TABLE]
Define a norm on by taking
[TABLE]
Then and satisfy the following norm bounds.
Lemma 4.21**.**
For any we have
[TABLE]
Proof.
Let . Using (4.4), we find that the reduction of modulo is given by
[TABLE]
so by (4.2) we obtain
[TABLE]
Lemma 4.15 (with ) then yields
[TABLE]
We also have
[TABLE]
Together these inequalities show that . ∎
Lemma 4.22**.**
For any we have
[TABLE]
Proof.
We may write down an explicit formula for as follows. Let
[TABLE]
let \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{H}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{H}}\in\mathbf{R}[X] be the unique polynomial of degree less than such that \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{H}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{H}}\equiv H\pmod{C(X)}, and let
[TABLE]
Then we claim that
[TABLE]
To prove (4.7), it suffices to verify that it holds modulo and modulo . For this is clear as by the definition of . It holds modulo because by (4.3) and the definition of we have
[TABLE]
Our goal is now to estimate the size of the coefficients of the polynomial
[TABLE]
Note that has degree at most , so its coefficients are exactly the same as those of .
Let us first estimate . Write , so that also \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{H}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{H}}(X)=H_{0}+\cdots+H_{N-1}X^{N-1}. We have
[TABLE]
By Lemma 4.16 (with ) we have . Using (4.2) and (3.1) we obtain
[TABLE]
[TABLE]
Therefore
[TABLE]
Since and (again from (4.2) and (3.1)), we find that
[TABLE]
so
[TABLE]
Now we may estimate the size of the coefficients of . The coefficients of (1-2^{-b}X)\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{H}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{H}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{H}}(X) are bounded in absolute value by , and from (4.4) we see that the coefficients of are bounded in absolute value by . Therefore, using again (3.1) and (4.2) we find that the coefficients of are bounded in absolute value by
[TABLE]
Next, we exhibit efficient algorithms for approximating and .
Proposition 4.23** (Approximating ).**
Given as input , we may compute
[TABLE]
such that
[TABLE]
in bit operations, assuming that .
Proof.
We first remark that since , we may precompute to a precision of significant bits using only operations in , by using Newton’s method to numerically solve the equation , starting with the initial approximation . (See also Remark 4.25 below.)
Now, given as input as above, we first compute an approximation to using the formula (4.5). The hypothesis , together with the rapid decay of the coefficients of , implies that this may be done using operations in . We may then compute the desired approximation to using the same method as in the proof of Proposition 3.17, at a cost of operations in . Finally, we may easily compute the desired approximation to using another operations in . ∎
Proposition 4.24** (Approximating ).**
Given as input and , we may compute
[TABLE]
such that
[TABLE]
in bit operations, assuming that .
Proof.
The algorithm amounts to evaluating the explicit formula (4.7). We first approximate using the same method as in the proof of Proposition 3.18, at a cost of operations in . (This requires more operations than the corresponding algorithm for , because the reductions modulo involve a few more terms than those modulo .) We then approximate at a cost of operations, and evaluate (4.7) in another operations. ∎
Remark 4.25*.*
In practice we always have , and this assumption allows several simplifications to be made to the algorithms in Proposition 4.23 and Proposition 4.24. First, the trivial approximation is already correct to significant bits, so Newton’s method is not required. In addition, we have , so instead of the complicated formula (4.6) for , we may simply use the approximation .
Finally we may state the main high product algorithm, and prove the main theorem concerning its correctness and complexity.
Proof of Theorem 2.3.
Line 2 decomposes and into chunks of bits. The splitting boundaries are different to those used for the full product and low product: here consists of the most significant bits of , then the next lower bits, and so on. The hypothesis ensures that this splitting is possible.
As in the proof of Theorem 2.1, let
[TABLE]
so that
[TABLE]
Let
[TABLE]
The polynomials \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}} in line 2 are just the images of and in . Our goal is to compute , the remainder on dividing by , as in Proposition 4.2. By definition this is equal to (1-2^{-b}X)\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}\pmod{B(X)}.
Line 3 computes approximations (\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{U}},\theta_{U}) and (\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{V}},\theta_{V}) to \gamma^{\dagger}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \gamma^{\dagger}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}. Line 4 computes \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{W}}, an approximation to \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{U}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{V}}, and , an approximation to . The latter involves just a single real multiplication. Let us write \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}}^{\prime} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}^{\prime} for the images of \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}} and \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}} in . A similar calculation to that used in the proof of Theorem 2.2 shows that
[TABLE]
and that
[TABLE]
These two inequalities may be expressed more briefly in combination by writing
[TABLE]
Line 5 computes an approximation \mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W}} to \delta^{\dagger}(\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.8125pt}{\wtildesym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-4.06876pt}{\wtildesym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.90625pt}{\wtildesym}}}}{W}},\theta_{W}). Using Lemma 4.20 and Lemma 4.22 we find that
[TABLE]
As noted earlier, the coefficients of are exactly those of (1-2^{-b}X)\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{U}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{U}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{U}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{V}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{V}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{V}}. We know from Proposition 4.2 that has integer coefficients, so we deduce that \operatorname{round}(2^{b}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\textstyle\text{\smash{\raisebox{-5.38193pt}{\wbarsym}}}}{W}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.76735pt}{\wbarsym}}}}{W}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.69096pt}{\wbarsym}}}}{W}}_{i})=2^{b}H_{i} for each . Thus the value computed in line 6 satisfies
[TABLE]
Applying Proposition 4.2, we obtain
[TABLE]
We have for , since each is a sum of terms of the form , so
[TABLE]
The hypothesis then yields
[TABLE]
Let be the value returned in line 7. Since , the inequality (4.8) implies that , and hence that . Moreover, since , we conclude that
[TABLE]
as desired. The running time analysis is essentially the same as in the proof of Theorem 2.2. ∎
5. Implementation and performance
We wrote an implementation of the new truncated product algorithms in the C programming language, together with a comparable implementation of the full product, to examine to what extent the predicted 25% reduction in complexity can be realised in practice. The source code is available from the author’s web page under a free software license.
The timings reported in Table 1 were run on a single core of an otherwise idle 2.5GHz Intel Xeon Gold 6248 (Cascade Lake microarchitecture), running Rocky Linux 8.8 (kernel version 4.18.0). We compiled our program using GCC 12.2.0 with the optimisation flags -O3 -mavx2 -mavx512f -ffast-math. In the critical inner loops, our code uses GCC’s vector extensions to take advantage of the AVX2 instruction set available on the target platform.
For the real convolutions, our code relies on the one-dimensional real-to-complex and complex-to-real transforms provided by the FFTW library (version 3.3.10) [5]. We configured FFTW using the --enable-avx2 --enable-avx512 flags, and used FFTW’s “wisdom” facility with the FFTW_MEASURE option to find efficient transform sequences for all relevant transform lengths.
Our implementation differs from the theoretical presentation in Section 3 and Section 4 in several respects:
- •
Instead of fixed point arithmetic, we use double-precision floating point (the double data type in C). In particular, this applies to the routines that compute , , and , and also the FFTs and pointwise multiplications. (The splitting and overlap-add steps are handled using integer arithmetic.) We make no attempt to prove any bounds for round-off error. This is impossible anyway in the context of our program, as FFTW does not offer any error guarantees.
- •
In the splitting step we allow signed coefficients. For example, we write where the coefficients of are integers lying in the balanced interval . This leads to less coefficient growth in the product : instead of these coefficients having roughly bits, for uniformly random inputs they tend to have around bits, due to cancellation between the positive and negative terms. Of course, an adversary could easily choose inputs for which every and is close to , in which case the product coefficients will have close to bits. In this case our program will certainly produce incorrect output, unless we decrease to compensate.
Table 1 shows timings for our low product, high product, and full product routines for various choices of (with parameters as indicated in Table 2), as well as timings for the full product computed by the mpz_mul function from the GMP multiple-precision arithmetic library (version 6.2.1) [7]. The reported timings are averages taken over numerous tests; for each entry in the table, the measurements are quite stable, with standard deviation around 1–2% of the average time.
Comparing the performance of our code against GMP is not quite fair, because in principle GMP performs a provably correct computation, whereas the output of our program is not provably correct (as explained above). Nevertheless, the timings demonstrate that our code is competitive with the highly optimised multiplication routines in GMP.
For each shown in Table 1, we chose the parameters as follows. We ran a large number of tests to determine the maximum possible for which the program consistently produces the correct output for uniformly random inputs and . The parameter refers to the number of terms used in the approximation of the ring isomorphisms such as ; it has the same meaning as in the proof of Proposition 3.17. Again, we chose by empirical testing, taking the smallest value that led to consistently correct output. Regarding the choice of , we examined several possible candidates, namely those of the form where and , and chose the candidate that led to the fastest timings. (In principle the choice of could also affect correctness, due to different FFT algorithms being used for different , but in practice we found it did not make a difference.) The resulting values of , and are shown in Table 2; these are the values that were used to produce the timings in Table 1.
The final column in Table 2 gives the ratio between the transform length used for the truncated products (the column labelled ) and the corresponding transform length for the full product (the column labelled ). As discussed in §2.3 (scenario #2), we expect this ratio to be close to . The observed values are reasonably close to , but there is some variation due to the sparsity of available transform lengths. We also predicted in §2.3 that the ratio of the corresponding values of should be about ; this is borne out clearly in Table 2.
Finally, let us discuss the main quantity of interest, namely, the ratio of the running times between the truncated products and the full product. These ratios are shown in parentheses in Table 1. In an ideal world these ratios would be close to . Unfortunately, the values shown in the table fall somewhat short of this goal. As increases, the performance appears to pass through three distinct phases:
- •
In the first few rows of the table, the ratio is close to . For these small values of , the savings from the shorter transform lengths in the truncated products are outweighed by the additional cost of evaluating the ring isomorphisms.
- •
There is a wide range of intermediate values of where the ratios for the low product lie roughly between 0.85 and 0.9, i.e., we see a speedup of 10–15% compared to the full product. The high product lags behind by about 2–3%.
- •
In the last two rows of the table, there is a serious degradation in performance. This appears to be mainly due to problems with FFTW’s handling of large transforms of composite length.
For example, for the low product in the last line, the FFTs of size account for roughly 87s of the total 109s, whereas for the corresponding full product, the FFTs of length account for about 63s out of the total 80s. One would normally expect the ratio of these FFT times to be about rather than . We did not explore the underlying reasons for this discrepancy, but we speculate that it is caused by suboptimal locality in the algorithms that FFTW uses to decompose a large composite-length FFT into smaller transforms.
In summary, our implementation shows that it is possible to achieve a nontrivial speedup for the computation of truncated products, over a wide range of values of . The author is hopeful that the running times may be improved further by more careful optimisation work, especially in the subroutines for computing the ring isomorphisms. It also seems likely that the performance for large values of may be improved by suitable tweaking of the underlying FFT algorithms.
Acknowledgments
The author thanks Joris van der Hoeven for his comments on a draft of this paper. The performance tests were carried out on the Katana computing cluster at UNSW [16]; many thanks to Martin Thompson for technical support. The author was supported by the Australian Research Council, grants DP150101689 and FT160100219.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Barrett, Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor , Advances in cryptology—CRYPTO ’86 (Santa Barbara, Calif., 1986), Lecture Notes in Comput. Sci., vol. 263, Springer, Berlin, 1987, pp. 311–323. MR 907099 (88i:94015)
- 2[2] D. Bernstein, Removing redundancy in high-precision Newton iteration , unpublished, available at http://cr.yp.to/papers.html#fastnewton , 2004.
- 3[3] R. P. Brent and P. Zimmermann, Modern computer arithmetic , Cambridge Monographs on Applied and Computational Mathematics, vol. 18, Cambridge University Press, Cambridge, 2011. MR 2760886
- 4[4] P. Bürgisser, M. Clausen, and M. A. Shokrollahi, Algebraic complexity theory , Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 315, Springer-Verlag, Berlin, 1997, With the collaboration of Thomas Lickteig. MR 1440179 (99c:68002)
- 5[5] M. Frigo, A Fast Fourier Transform Compiler , Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation (New York, NY, USA), PLDI ’99, ACM, 1999, pp. 169–180.
- 6[6] I. M. Gessel, Lagrange inversion , J. Combin. Theory Ser. A 144 (2016), 212–249. MR 3534068
- 7[7] T. Granlund, The GNU Multiple Precision Arithmetic Library (Version 6.2.1) , http://gmplib.org/ .
- 8[8] D. Harvey and J. van der Hoeven, Faster integer and polynomial multiplication using cyclotomic coefficient rings , https://arxiv.org/abs/1712.03693 , 2017.
