TL;DR
This paper explores the practical implementation of efficient univariate polynomial matrix algorithms, demonstrating improved performance in computing bivariate resultants, with implications for error-correcting codes and polynomial systems.
Contribution
It provides the first practical implementation of several fundamental polynomial matrix operations and applies these to enhance bivariate resultant computations.
Findings
Improved performance over existing methods for large parameters
Successful implementation of key polynomial matrix operations
Enhanced algorithms for bivariate resultants
Abstract
Complexity bounds for many problems on matrices with univariate polynomial entries have been improved in the last few years. Still, for most related algorithms, efficient implementations are not available, which leaves open the question of the practical impact of these algorithms, e.g. on applications such as decoding some error-correcting codes and solving polynomial systems or structured linear systems. In this paper, we discuss implementation aspects for most fundamental operations: multiplication, truncated inversion, approximants, interpolants, kernels, linear system solving, determinant, and basis reduction. We focus on prime fields with a word-size modulus, relying on Shoup's C++ library NTL. Combining these new tools to implement variants of Villard's algorithm for the resultant of generic bivariate polynomials (ISSAC 2018), we get better performance than the state of the art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Implementations of efficient univariate polynomial matrix algorithms and
application to bivariate resultants
Seung Gyu Hyun
University of WaterlooWaterloo, ONCanada
,
Vincent Neiger
Univ. Limoges, CNRS, XLIM, UMR 7252F-87000 LimogesFrance
and
Éric Schost
University of WaterlooWaterloo, ONCanada
(2019)
Abstract.
Complexity bounds for many problems on matrices with univariate polynomial entries have been improved in the last few years. Still, for most related algorithms, efficient implementations are not available, which leaves open the question of the practical impact of these algorithms, e.g. on applications such as decoding some error-correcting codes and solving polynomial systems or structured linear systems. In this paper, we discuss implementation aspects for most fundamental operations: multiplication, truncated inversion, approximants, interpolants, kernels, linear system solving, determinant, and basis reduction. We focus on prime fields with a word-size modulus, relying on Shoup’s C++ library NTL. Combining these new tools to implement variants of Villard’s algorithm for the resultant of generic bivariate polynomials (ISSAC 2018), we get better performance than the state of the art for large parameters.
Polynomial matrices, algorithms, implementation, resultant.
††copyright: rightsretained††journalyear: 2019††conference: International Symposium on Symbolic and AlgebraicComputation; July 15–18, 2019; Beijing, China††booktitle: International Symposium on Symbolic and Algebraic Computation(ISSAC ’19), July 15–18, 2019, Beijing, China††price: 15.00††doi: 10.1145/3326229.3326272††isbn: 978-1-4503-6084-5/19/07
1. Introduction
Hereafter, is a field and is the algebra of univariate polynomials over . Recent years have witnessed a host of activity on fast algorithms for polynomial matrices and their applications:
- •
Minimal approximant bases (Giorgi et al., 2003; Zhou and Labahn, 2012) were used to compute kernel bases (Zhou et al., 2012), giving the first efficient deterministic algorithm for linear system solving over .
- •
Basis reduction (Giorgi et al., 2003; Gupta et al., 2012) played a key role in accelerating the decoding of one-point Hermitian codes (Nielsen and Beelen, 2015) and in designing deterministic determinant and Hermite form algorithms (Labahn et al., 2017).
- •
Progress on minimal interpolant bases (Jeannerod et al., 2017, 2016) led to the best known complexity bound for list-decoding Reed-Solomon codes and folded Reed-Solomon codes (Jeannerod et al., 2017, Sec. 2.4 to 2.7).
- •
Coppersmith’s block Wiedemann algorithm and its extensions (Coppersmith, 1994; Kaltofen, 1995; Villard, 1997) were used in a variety of contexts, from integer factorization (Team, 2017) to polynomial system solving (Villard, 2018; Hyun et al., 2017b).
At the core of these improvements, one also finds techniques such as high-order lifting (Storjohann, 2003) and partial linearization (Storjohann, 2006),(Gupta et al., 2012, Sec. 6).
For many of these operations, no implementation of the latest algorithms is available and no experimental evidence has been given regarding their practical behavior. Our goal is to partly remedy this issue, by providing and discussing implementations for a core of fundamental algorithms such as multiplication, approximant and interpolant bases, etc., upon which one may implement higher level algorithms. As an illustration, we describe the performance of slightly modified versions of Villard’s recent breakthroughs on bivariate resultant and characteristic polynomial computation (Villard, 2018).
Our implementation is based on Shoup’s Number Theory Library (NTL) (Shoup, 2018), and is dedicated to polynomial matrix arithmetic over for a word-size prime . Particular attention was paid to performance issues, so that our library compares favorably with previous work for those operations where comparisons were possible. Our code is available at https://github.com/vneiger/pml.
Overview.
Polynomial matrix algorithms rely on efficient arithmetic in and for matrices over ; in Section 2, we review some related algorithms and their NTL implementations. Then, we describe our implementation of a key building block: multiplication.
Section 3 presents the next major part of our work, concerning algorithms for approximant bases (Beckermann and Labahn, 1994; Giorgi et al., 2003; Zhou and Labahn, 2012; Jeannerod et al., 2018) and interpolant bases (Van Barel and Bultheel, 1992; Beckermann and Labahn, 2000; Jeannerod et al., 2016, 2017). We focus on a version of interpolants which is less general than in these references but allows for a more efficient algorithm. In particular, we show that with this version, both interpolant and approximant bases can be used interchangeably in several contexts, with interpolants sometimes achieving better performance than approximants. In Section 4, we discuss algorithms for minimal kernel bases, linear system solving, determinant, and basis reduction. Finally, using these tools, we study the practical behavior of the bivariate resultant algorithm of (Villard, 2018) (Section 5).
Below, cost bounds are given in an algebraic complexity model, counting all operations in the base field at unit cost. While standard, this point of view fails to describe parts of the implementation (CRT-based algorithms, such as the 3-primes FFT, cannot be described in such a manner), but we believe that this is a minor issue.
Implementation choices.
NTL is a C++ library for polynomial and matrix arithmetic over rings such as , , etc., and is often seen as a reference point for fast implementations in such contexts. Other libraries for these operations include for example FLINT (Hart et al., 2015) as well as FFLAS-FFPACK and LinBox (The FFLAS-FFPACK Group, 2019; The LinBox Group, 2018). Currently, our implementation relies solely on NTL; this choice was based on comparisons of performance for the functionalities we need.
In our implementation, the base field is a prime finite field ; we rely on NTL’s lzz_p class. At the time of writing, on standard x86_64 platforms, NTL v11.3.1 uses unsigned long’s as its primary data type for lzz_p, supporting moduli up to 60 bits long.
For such fields, one can directly compare running times and cost bounds, since in the literature most polynomial matrix algorithms are analyzed in the algebraic complexity model. Besides, computations over are at the core of a general approach consisting in solving problems over or by means of reduction modulo sufficiently many primes, which are chosen so as to satisfy several, partly conflicting, objectives. We may want them to support Fourier transforms of high orders. Linear algebra modulo each prime should be fast, so we may wish them to be small enough to support vectorized matrix arithmetic with SIMD instructions. On the other hand, using larger primes allows one to use fewer of them, and reduces the likelihood of unlucky choices in randomized algorithms.
As a result, while all NTL lzz_p moduli are supported, our implementation puts an emphasis on three families: small FFT primes that support AVX-based matrix multiplication (such primes have at most 23 bits); arbitrary size FFT primes (at most 60 bits); arbitrary moduli (at most 60 bits). Very small fields such as or are supported, but we did not make any specific optimization for them.
Experiments.
All runtimes below are in seconds and were measured on an Intel Core i7-4790 CPU with 32GB RAM, using the version 11.3.1 of NTL. Unless specified otherwise, timings are obtained modulo a random 60 bit prime. Runtimes were measured on a single thread; currently, most parts of our code do not explicitly exploit multi-threading. Tables below only show a few selected timings, with the best time(s) in bold; for more timings, see https://github.com/vneiger/pml/tree/master/benchmarks.
2. Basic polynomial and matrix arithmetic
We review basic algorithms for polynomials and matrices, and related complexity results that hold over an abstract field , and we describe how we implemented these operations. Hereafter, for , is the set of elements of of degree less than .
2.1. Polynomial multiplication.
Multiplication in and Fast Fourier Transform (FFT) are cornerstones of most algorithms in this paper. Let be a function such that polynomials of degree at most in can be multiplied in operations in . If supports FFT, we can take , and otherwise, (Gathen and Gerhard, 2013, Chapter 8); as in this reference, we assume that is increasing. A useful variant of multiplication is the middle product (Hanrot et al., 2004; Bostan et al., 2003): for integers and , and in and in , returns the slice of the product with coefficients of degrees ; a common case is with . The direct approach computes the whole product and extracts the slice. Yet, the transposition principle (Kaminski et al., 1988) yields a more efficient approach, saving a constant factor (roughly a factor when , if FFT multiplication is used).
Polynomial matrix algorithms frequently use fast evaluation and interpolation at multiple points. In general, subproduct tree techniques (Gathen and Gerhard, 2013, Chapter 10) allow one to do evaluation and interpolation of polynomials in at points in operations. For special sets of points, one can do better: if we know in of order at least , then evaluation and interpolation at the geometric progression can both be done in time (Bostan and Schost, 2005).
In NTL, multiplication in uses either naive, Karatsuba, or FFT techniques, depending on and on the degree (NTL provides FFT primes with roots of unity of order , and supports arbitrary user-chosen FFT primes). FFT multiplication uses the TFT algorithm of (Harvey, 2009) and Harvey’s improvements on arithmetic mod (Harvey, 2014). For primes that do not support Fourier transforms, multiplication is done by means of either 3-primes FFT techniques (Gathen and Gerhard, 2013, Chapter 8) or Schönhage and Strassen’s algorithm. We implemented middle products for naive, Karatsuba and FFT multiplication, closely following (Hanrot et al., 2004; Bostan et al., 2003), as well as evaluation/interpolation algorithms for general sets of points and for geometric progressions.
2.2. Matrix multiplication.
Let be such that matrices over any ring can be multiplied by a bilinear algorithm doing ring operations. The naive algorithm does exactly multiplications. First improvements due to Winograd and Waksman (Winograd, 1968; Waksman, 1970) reduced the number of operations to if is a unit. Strassen’s and Winograd’s recursive algorithms (Strassen, 1969; Winograd, 1971) have ; the best known bound is (Coppersmith and Winograd, 1990; Le Gall, 2014). Note that, using blocking, rectangular matrices of sizes and can be multiplied in ring operations. NTL implements its own arithmetic for matrices over and chooses one of several implementations depending on the bitsize of , the matrix dimensions, the available processor instructions, etc.
2.3. Polynomial matrix multiplication.
In what follows, we write for a function such that two matrices of degree at most can be multiplied in operations in ; we make the assumption that is increasing for all .
From the definitions above we obtain , which is in . Using evaluation/interpolation at or at roots of unity, one obtains the following bounds on :
- •
if an element in of order more than is known (Bostan and Schost, 2005, Thm. 2.4).
- •
if supports FFT in degree .
We also mention a polynomial analogue of an integer matrix multiplication algorithm from (Doliskani et al., 2018) which uses evaluation/interpolation, done plainly via multiplication by (inverse) Vandermonde matrices. Then, the corresponding part of the cost (e.g. for geometric progressions) is replaced by the cost of multiplying matrices over in sizes roughly by ; this is in if . For moderate values of , where is not in the FFT regime, this allows us to leverage fast matrix multiplication over .
We implemented and compared various algorithms for matrix multiplication over . For matrices of degree less than 5, we use dedicated routines based on Karatsuba’s and Montgomery’s formulas (Montgomery, 2005); for matrices of small size (up to 10, depending on ), we use Waksman’s algorithm. For other inputs, most of our efforts were spent on variants of the evaluation/interpolation scheme.
For FFT primes, we use evaluation/interpolation at roots of unity. For general primes, we use either evaluation/interpolation at geometric progressions (if such points exist in ), or our adaptation of the algorithm of (Doliskani et al., 2018), or 3-primes multiplication (as for polynomials, we lift the product from to , where it is done modulo up to 3 FFT primes). No single variant outperformed or underperformed all others for all sizes and degrees, so thresholds were experimentally determined to switch between these options, with different values for small (less than 23 bits) and for large primes.
Middle product versions of these algorithms were implemented, and are used in approximant basis algorithms (Section 3.1) and Newton iteration (Section 4.3). Multiplier classes are available: they do precomputations on a matrix to accelerate repeated multiplications by ; they are used in Dixon’s algorithm (Section 4.3).
The table below shows timings for our multiplication and LinBox’ one, for random matrices of degree and two choices of prime . The global comparison showed running times that are either similar or in favor of our implementation.
20 bit FFT prime 60 bit prime
ours Linbox ratio ours Linbox ratio
8 131072 1.1198 1.5930 0.70 3.577 13.59 0.26
32 4096 0.4283 0.5092 0.84 2.000 5.330 0.38
128 1024 1.7292 2.1126 0.82 15.73 23.13 0.68
512 128 4.3533 4.3837 0.99 41.57 50.62 0.82
3. Approximant bases and interpolant bases
These bases are matrix generalizations of Padé approximation and play an important role in many higher-level algorithms. For in and non-constant in , they are bases of the -module of all in such that . Specifically, approximant bases are for and interpolant bases for for distinct points in . (Here, we do not consider more general cases from the literature, for example with several moduli , one for each column of .)
Since is free of rank , such a basis is represented row-wise by a nonsingular in . The algorithms below return in -ordered weak Popov form (also known as -quasi Popov form (Beckermann et al., 1999)), for a given shift in . Shifts allow us to set degree constraints on the sought basis , and they inherently occur in a general approach for finding bases of solutions to equations (approximants, interpolants, kernels, etc.). Approximant basis algorithms often require to be in -reduced form (Van Barel and Bultheel, 1992); although the -ordered weak Popov form is stronger, obtaining it involves minor changes in these algorithms, without impact on performance according to our experiments. Recent literature shows that this stronger form reveals valuable information for further computations with (Jeannerod et al., 2016; Jeannerod et al., 2018), in particular for finding -Popov bases (Beckermann et al., 1999).
From the shift , the -degree of is defined as , which extends to matrices: is the list of -degrees of the rows of . Then, the -pivot of is its rightmost entry such that , and a nonsingular matrix is in -ordered weak Popov form if the -pivots of its rows are located on the diagonal.
To simplify cost bounds below, we make use of the function .
3.1. Approximant bases.
For in and in , an approximant basis for is a nonsingular matrix whose rows form a basis of . We implemented minor variants of the algorithms M-Basis (iterative, via matrix multiplication) and PM-Basis (divide and conquer, via polynomial matrix multiplication) from (Giorgi et al., 2003). The lowest-level function (M-Basis-1 with the signature in Algorithm 1), handles order in time ; here, working modulo , the matrix is over . Our implementation follows (Jeannerod et al., 2018, Algo. 1), which returns an -Popov basis, using only an additional row permutation compared to the algorithm in (Giorgi et al., 2003).
This form of the output of M-Basis-1 suffices to ensure that M-Basis and PM-Basis return bases in -ordered weak Popov form. Our implementation of M-Basis follows the original design (Giorgi et al., 2003) with iterations, each computing the residual and updating via multiplication by a basis obtained by M-Basis-1 on . We also follow (Giorgi et al., 2003) for PM-Basis, using a threshold such that M-Basis is called if . Building PM-Basis directly upon M-Basis-1, i.e. choosing , has the same cost bound but is slower in practice.
These algorithms use and , respectively (Giorgi et al., 2003). Some implementation details are discussed in Section 3.2. The next table compares timings for LinBox’ and our implementations of PM-Basis for a 20 bit FFT prime (LinBox’ implementation is not optimized for large primes and general primes).
[TABLE]
We also implemented (Jeannerod et al., 2018, Algo. 3) which returns -Popov bases and is about twice slower than PM-Basis; making this overhead negligible for some usual cases is future work. For completeness, we handle general approximants (with one modulus per column of ) by an iterative approach from (Van Barel and Bultheel, 1992; Beckermann and Labahn, 1994); faster algorithms are more complex (Jeannerod et al., 2016, 2017; Jeannerod et al., 2018) and use partial linearization techniques.
These techniques from (Storjohann, 2006; Zhou and Labahn, 2012) yield cost bounds in , which is a speedup compared to PM-Basis. Implementing them is work in progress. experimental code, which focuses for simplicity on and “generic” inputs for which the degrees in can be predicted, revealed significant speedups:
[TABLE]
Approximant bases are often applied to solve block-Hankel systems (Labahn et al., 1990). In two specific settings, we have compared this approach to the one which uses structured matrix algorithms; we are not aware of previous comparisons of this kind. We obtain the following running times, using the NTL-based solver from (Hyun et al., 2017a).
[TABLE]
Setting 1: we call PM-Basis on at order with shift , where is an matrix of degree , and we solve a system with Hankel blocks of size (the structured solver returns a random solution to the system). Our experiments show a clear advantage for approximant algorithms. The asymptotic costs being similar, the effects at play here are constant factor differences: approximant basis algorithms seem to be somewhat simpler and to better leverage the main building blocks (matrix arithmetic over and univariate polynomial arithmetic).
Setting 2 is the vector rational reconstruction problem. We call PM-Basis on at order with shift , where is a vector of degree , and we solve a block system with Hankel blocks of size . The cost bounds are and , respectively. Approximants are faster up to dimension about 15, which is explained by the arguments in the previous paragraph. For larger dimensions, as predicted by the cost estimates, the block-Hankel solver is more efficient.
3.2. Interpolant bases.
For matrices in and pairwise distinct points in , consider
[TABLE]
An interpolant basis for is a matrix whose rows form a basis of the -module . Note that coincides with , for in and .
This definition is a specialization of those in (Beckermann and Labahn, 2000; Jeannerod et al., 2017), which consider sets of points, one for each of the columns of : here, these sets are all equal. This more restrictive problem allows us to give faster algorithms than those in these references, by direct adaptations of the approximant basis algorithms presented above. Besides, Sections 4.1 and 4.2 will show that interpolant bases can often play the same role as approximant bases in applications.
These adaptations are described in Algorithms 4 and 5, where stands for the sublist . In the next proposition, we assume that is in (instead, one may add an extra term in the cost).
Proposition 3.0.
Algorithm 5* is correct. For input evaluation points in geometric progression, it costs if and otherwise. For general evaluation points, an extra cost is incurred.*
(Correctness follows from Items (i) and (iii) of (Jeannerod et al., 2018, Lem. 2.4); the cost analysis is standard for such divide and conquer algorithms.)
Our current code uses the threshold in the divide and conquer PM-Basis and PM-IntBasis: beyond this point, they are faster than the iterative M-Basis and M-IntBasis. Unlike in most other functions, where elements of are represented as matrices of polynomials (Mat<Vec<zz_p>> in NTL), in M-Basis and M-IntBasis we see them as polynomials with matrix coefficients (Vec<Mat<zz_p>>). Indeed, since these algorithms involve only matrix arithmetic over (recall that ), this turns out to be more cache-friendly and faster.
We implemented two variants for approximant bases: either the residual is computed from and at each iteration, or we initialize a list of residuals with a copy of and we update the whole list at each iteration using . The second variant improves over the first when , with significant savings when is close to . For interpolant bases, this does not lead to any gain.
Timings are showed in the next table, for Algorithms M-Basis (M), M-IntBasis (M-I), PM-Basis (PM), PM-IntBasis for general points (PM-I) and for geometric points (PM-Ig). For approximants, we take a random input in of degree ; for interpolants, we take random matrices in . We focus on the common case , which arises for example in kernel algorithms (Section 4.2) and in fraction reconstruction, itself used in basis reduction (Section 4.5) and in the resultant algorithm of (Villard, 2018) (Section 5).
[TABLE]
Concerning iterative algorithms, we observe that interpolants are slightly faster than approximants, which is explained by the cost of computing the residual : it uses one Horner evaluation of and one matrix product for interpolants, whereas for approximants it uses about matrix products at iteration .
As for the divide and conquer algorithms, interpolant bases with general points are slower, in some cases significantly, than the other two algorithms: although the complexity analysis predicted a disadvantage, we believe that our implementation of multipoint evaluation at general points could be improved to reduce this gap. For the other two algorithms, the comparison is less clear. There could be many factors at play here, but the main differences lie in the base case (Step 1) which calls the iterative algorithm, and in the computation of residuals (Step 3) which uses either middle products or geometric evaluation. It seems that FFT-based polynomial multiplication performs slightly better than geometric evaluation for small matrices and slightly worse for large matrices.
4. Higher-level algorithms
In this section we consider kernel bases, system solving, determinants, and basis reduction; we discuss algorithms which rely on multiplication, through approximant/interpolant bases and lifting techniques. For many of these algorithms, we are not aware of previous implementations or experimental comparisons.
4.1. A note on matrix fraction reconstruction.
Given in , a left fraction description of is a pair of polynomial matrices in such that . It is minimal if and have unimodular left matrix GCD and is in reduced form (right fraction descriptions are defined similarly). Besides, is said to be strictly proper if the numerator of each of its entries has degree less than the corresponding denominator.
Such a description of is often computed from the power series expansion of at sufficient precision, using an approximant basis. Yet, for resultant computations in Section 5.2, we would like to use an interpolant basis to obtain this description from sufficiently many values of . We now state the validity of this approach; this is a matrix version of rational function reconstruction (Gathen and Gerhard, 2013, Chap. 5.7).
Proposition 4.0.
*Let be in be strictly proper and suppose admits left and right fraction descriptions of degrees at most , for some . For in of degree at least and such that all denominators in are invertible modulo , define the matrix Then, if is a -ordered weak Popov basis of , the first rows of form a matrix such that is a minimal left fraction description of , with in -ordered weak Popov form. *
The proof given in (Giorgi et al., 2003, Lem 3.7) for the specific extends to any modulus ; using an ordered weak Popov form (rather than a reduced form) allows us both to know a priori that the first rows are those of degree at most , and to use degree instead of (since is ensured by this form).
In particular, if for pairwise distinct points , the interpolant basis algorithms in Section 3.2 give a minimal left fraction description of from .
4.2. Kernel basis.
We implemented two kernel basis algorithms: the first one, based on Lemma 4.0, finds the kernel basis from a single approximant basis at sufficiently large order; the second one, from (Zhou et al., 2012), uses several approximant bases at small order and combines recursively obtained kernel bases via multiplication. With a minor modification and no performance impact, the latter returns an -ordered weak Popov basis. In both cases, we designed variants which rely on interpolant bases instead of approximant bases.
Lemma 4.0.
Let be in of degree , let be in , and let in be an upper bound on the -degree of any -reduced left kernel basis of ; for example, . Let be in of degree at least , and in be an -reduced basis of . Then, the submatrix of formed by its rows of -degree less than is an -reduced left kernel basis for .
For a proof, see https://hal.archives-ouvertes.fr/hal-01995873v1/document.
In particular, one may find via PM-IntBasis at points or via PM-Basis at order ; for , this costs . The approximant-based direct approach is folklore (Zhou et al., 2012, Sec. 2.3), yet explicit statements in the literature focus on shifts linked to the degrees in , with better bounds (see (Zhou et al., 2012, Lem. 3.3),(Neiger et al., 2018, Lem. 4.3)).
The algorithm of (Zhou et al., 2012) is more efficient, at least when the entries of are close to the corresponding row degrees of ; for a uniform shift, it costs operations. We obtained significant practical improvements over the plain implementation of (Zhou et al., 2012, Algo. 1) thanks to the following observation: if , for a vast majority of input , the approximant basis at Step 2 of (Zhou et al., 2012, Algo. 1), computed at order more than , contains the sought kernel basis. Furthermore, this can be easily tested by checking well-chosen degrees, and then the algorithm can exit early, avoiding the further recursive calls. We took advantage of this via the following modifications: we use order rather than (see (Zhou et al., 2012, Rmk. 3.5) for a discussion on this point), and when we directly reduce the number of columns via the divide and conquer scheme in (Zhou et al., 2012, Thm. 3.15).
The use of approximants here follows the idea in Lemma 4.0: row vectors of small degree which are in for a large degree must be in the kernel of . Thus, one can directly replace approximant bases with interpolant bases in (Zhou et al., 2012, Algo. 1), up to modifying Step 8 accordingly (dividing by the appropriate polynomial ).
Timings for both approaches are showed in the next table, for a random of degree . Except for the last row, the shift is uniform and, as expected, (Zhou et al., 2012, Algo. 1) is faster than the direct approach; the differences between interpolant and approximant variants follow those observed in Section 3. The last row corresponds to a shift yielding the kernel basis in Hermite form and shows, as expected, that the direct approach is faster for shifts that are far from uniform.
We note that (Zhou et al., 2012, Algo. 1) may use partial linearization if it computes matrix products with unbalanced degrees or approximant bases with . We have not yet implemented this part of the algorithm, which may lead to slowdowns for some rare inputs.
[TABLE]
4.3. Linear system solving.
For systems , with in , in and in , we implemented two families of algorithms. The first one uses lifting techniques, assuming is square, nonsingular, with invertible; the algorithm returns a pair in such that and has minimal degree. The second one uses a kernel basis and works for any input ; under the assumptions above, it has a similar output.
Lifting techniques.
Under the above assumptions, our lifting algorithm is standard: if and have degree at most , we first compute the truncated inverse by matrix Newton iteration (Schulz, 1933). Then, we use Dixon’s algorithm (Dixon, 1982) to compute ; it consists of roughly steps, each involving a matrix-vector product using either or . Then, vector rational reconstruction is applied to recover from . The cost of this algorithm is for the truncated inverse of and for Dixon’s algorithm; overall this is in .
To reduce the exponent in , Storjohann introduced the high-order lifting algorithm (Storjohann, 2003). The core of this algorithm is the computation of slices of the power series expansion of , where the coefficients of are the coefficients of degree in . These matrices are computed recursively, each step involving 4 matrix products; the other steps of the algorithm, that use these to compute , are cheaper, so the runtime is .
Using kernel bases.
For this second approach, let be any matrix in and be in . The algorithm simply computes , a right kernel basis of the augmented matrix . The matrix generates, via -linear combinations of its columns, all solutions to .
In particular, if is empty (i.e. , which requires ), or if the last row of is zero, then the system has no solution. Furthermore, if is square and nonsingular, has a single column , where and , with of minimal degree (otherwise, would not be a basis).
In this context, the fastest known kernel algorithm is (Zhou et al., 2012, Algo. 1). To exploit it best, we choose the input shift , where and is the tuple of column degrees of (zero columns of are discarded while computing ).
Implementation.
We implemented the approaches described above: lifting with Dixon’s algorithm, high-order lifting, and via kernel. The table below shows timings for randomly chosen matrix and vector , both of degree . In this case the lifting algorithms apply (with high probability). On such inputs, Dixon’s algorithm usually does best. High-order lifting, although theoretically faster, is outperformed, mainly because it performs matrix products (we will however see that this algorithm still plays an important role for basis reduction). The kernel based approach is moderately slower than Dixon’s algorithm, but has the advantage of working without any assumption on .
[TABLE]
4.4. Determinant.
We implemented four algorithms, taking as input a square matrix .
The most basic one uses expansion by minors, which turns out to be the fastest option up to dimension about 6.
The second one assumes that we have an element in of order at least , and uses evaluation/interpolation at the geometric progression ; this costs operations in . For dimensions between 7 and about 20, it is often the fastest variant, sometimes competing with the third.
The third one consists in solving a linear system with random right-hand side of degree ; this yields a solution with the sought determinant up to a constant (Pan, 1988). For dimensions exceeding 20, this is the fastest method (assuming that is invertible).
The last one, from (Labahn et al., 2017), is based on triangularization and runs in . Our implementation currently only supports the generic case where the Hermite form of has diagonal , or in other words, all so-called row bases computed in that algorithm are the identity. This allows us to circumvent the temporary lack of a row basis implementation, while still being able to observe meaningful timings. Indeed, we believe that timings for a complete implementation called on such “generic” input will be similar to the timings presented here in many cases of interest, where one can easily detect whether the algorithm has made a wrong prediction about the row basis being identity (for example if the input matrix is reduced, which is the case if it is the denominator of a minimal fraction description). This recursive determinant algorithm calls expansion by minors as a base case for small dimensions; for larger dimensions, it is generally slightly slower than the third method.
The timings below are for a random matrix of degree .
[TABLE]
4.5. Basis reduction.
Our implementation of the algorithm of (Giorgi et al., 2003) takes as input a nonsingular matrix of degree such that is invertible, and returns a reduced form of . The algorithm first computes a slice of consecutive coefficients of degree about in the power series expansion of , then uses PM-Basis to reconstruct a fraction description , and then returns . A Las Vegas randomized version is mentioned in (Giorgi et al., 2003), to remove the assumption on : we will implement it for large enough , but for smaller fields this requires to work in an extension of , which is currently beyond the scope of our work.
In our experiments, to create the input , we started from a random matrix of degree (which is reduced with high probability), and we left-multiplied it by a lower unit triangular matrix and then by an upper one, both chosen at random of degree . The following table shows timings for both steps, with the first step either based on Newton iteration or on high-order lifting; the displayed total time is when using the faster of the two. We conclude that for reduction, as opposed to the above observations for system solving, it is crucial to rely on high-order lifting. Indeed, it improves over Newton iteration already for dimension 8, and the gap becomes quite significant when the dimension grows.
[TABLE]
5. Applications to bivariate resultants
We conclude this paper with algorithms originating from Villard’s recent breakthrough on computing the determinant of structured polynomial matrices (Villard, 2018). Fix a field and consider the two following questions: computing the resultant of two polynomials in with respect to , and computing the characteristic polynomial of an element in , for some in .
The second problem is a particular case of the former, since the characteristic polynomial of modulo is the resultant of and with respect to , up to a nonzero constant. Let be an upper bound on the degree in of the polynomials we consider, and be a bound on their degree in (so in the second problem, ). Villard proved that for generic inputs, both problems can be solved in operations in . For the first problem, the best previous bound is , obtained either by evaluation/interpolation techniques or Reischert’s algorithm (Reischert, 1997). For the second problem, the previous record was , where is the exponent of matrix multiplication in size , with (Le Gall and Urrutia, 2018). Note that these bounds apply to all inputs.
We show how the work we presented above allows us to put Villard’s ideas to practice, and outperform the previous state of the art for large input sizes. This is however not straightforward: in both cases, this required modifications of Villard’s original designs (for the second case, using an algorithm from (Neiger et al., 2019)).
5.1. Overview of the approach.
In (Villard, 2018), Villard designed the following algorithm to find the determinant of a matrix over .
The parameter is chosen so as to minimize the theoretical cost. The correctness of the algorithm follows from the next properties, which do not hold for an arbitrary nonsingular : the matrix is strictly proper and admits a left fraction description such that , for and in of degree at most (see Section 4.1 for definitions). In (Villard, 2018), they are proved to hold for generic instances of the problems discussed here.
Once sufficiently many terms of the expansion of have been obtained in Step 1, the denominator is recovered by an approximant basis algorithm and its determinant is computed by a general algorithm in Steps 2 and 3, which cost .
While the algorithm applies to any nonsingular matrix satisfying the properties above, in general it does not improve over previously known methods (see Section 4.4). Indeed, the fastest known algorithm for obtaining costs operations via high-order lifting (see for example (Gupta et al., 2012, Thm. 1)).
However, sometimes has some structure which helps to speed up the first step. Villard pointed out that when is the Sylvester matrix of two bivariate polynomials, then is a Toeplitz-like matrix which can be described succinctly as ; here, , (resp. , ) are lower (resp. upper) triangular Toeplitz matrices with entries in . Hence, we start by computing the first columns of and of as well as the first rows of and of , all of them modulo ; then, is directly obtained via the above formula for , using operations. Computing these rows and columns is done by solving systems with matrices and and very simple right-hand sides (Villard, 2018, Prop. 5.1), with power series coefficients, in time .
Altogether, taking minimizes the cost, yielding the runtime . In the case of bivariate resultants described above, the Sylvester matrix of and has size , hence the cost bound .
5.2. Resultant of generic bivariate polynomials.
We implemented the algorithm described in the previous section to compute the resultant of generic in ; first experiments showed that obtaining , , , was a bottleneck. These vectors have power series entries and are solutions of linear systems whose matrix is the Sylvester matrix of and or its transpose: they were obtained via Hensel lifting techniques, following (Gathen and Gerhard, 2013, Ch. 15.4).
To get better performance, we designed a minor variant of Villard’s algorithm: instead of computing the power series expansion of modulo , where , we compute values of at points. We choose these points in geometric progression and use the interpolant basis algorithm of Section 3.2 to recover and , as detailed in Section 4.1. The value of at is computed following the same approach as above, but over instead of . In particular, our implementation directly relies on NTL’s extended GCD algorithm over to compute the vectors , , , .
The next table compares our implementation to the direct approach via evaluation/interpolation; note that the latter approach, while straightforward conceptually, is the state of the art.
[TABLE]
[TABLE]
For these running times, input polynomials were chosen at random with partial degree both in and in ; such polynomials have total degree , and their resultant has degree . The largest examples have quite significant sizes, but such degrees are not unheard-of in applications, as for instance in the genus-2 point counting algorithms of (Gaudry and Harley, 2000; Gaudry and Schost, 2004, 2012; Abélard et al., 2018). Overall, with , we observe a crossover point around . Besides, in a close match with the analysis above, the parameter was set to since this gave us the best runtimes. As an example, for , the cost of each individual steps were 65s for computing structured inversions, and 40s for obtaining and its determinant, which is a good balance.
5.3. Characteristic polynomial.
We consider the computation of the characteristic polynomial of an element in , for some monic in of degree . The algorithm we implemented, and which we sketch below, is from (Neiger et al., 2019) and assumes that and are generic.
As explained previously, this problem is a particular case of a bivariate resultant, but we rely on another point of view that allows for a better asymptotic cost. Indeed, the characteristic polynomial of modulo is by definition the characteristic polynomial of the matrix of multiplication by modulo . In other words, it is the determinant of the degree-1 matrix .
The genericity assumption ensures that is invertible, hence the power series expansion of is . Here, we use the top-left quadrant of ; it has entries , where
[TABLE]
for and for all .
A direct implementation of this idea does not improve on the runtime given in Section 5.1, since it computes for all and therefore costs . It turns out that baby-steps giant-steps techniques allow one to compute for and in operations in . Taking minimizes the overall cost, resulting in the runtime .
The following table compares our implementation to NTL’s built-in characteristic polynomial algorithm, with random inputs and . For such inputs, NTL uses Shoup’s algorithm for power projection (Shoup, 1994), which runs in time .
[TABLE]
Acknowledgements.
Hyun was supported by a Mitacs Globalink award; Neiger was supported by CNRS/INS2I’s program for young researchers; Schost was supported by an NSERC Discovery Grant.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Abélard et al . (2018) S. Abélard, P. Gaudry, and P.-J. Spaenlehauer. 2018. Counting points on genus-3 hyperelliptic curves with explicit real multiplication. In ANTS-XIII .
- 3Beckermann and Labahn (1994) B. Beckermann and G. Labahn. 1994. A Uniform Approach for the Fast Computation of Matrix-Type Padé Approximants. SIAM J. Matrix Anal. Appl. 15, 3 (1994), 804–823.
- 4Beckermann and Labahn (2000) B. Beckermann and G. Labahn. 2000. Fraction-free computation of matrix rational interpolant and matrix gcds. SIAM J. Matrix Anal. Appl. 22, 1 (2000), 114–144.
- 5Beckermann et al . (1999) B. Beckermann, G. Labahn, and G. Villard. 1999. Shifted Normal Forms of Polynomial Matrices. In ISSAC’99 . ACM, 189–196.
- 6Bostan et al . (2003) A. Bostan, G. Lecerf, and É. Schost. 2003. Tellegen’s Principle Into Practice. In ISSAC’03 . ACM, 37–44.
- 7Bostan and Schost (2005) A. Bostan and É. Schost. 2005. Polynomial evaluation and interpolation on special sets of points. J. Complexity 21, 4 (2005), 420–446.
- 8Coppersmith (1994) D. Coppersmith. 1994. Solving homogeneous linear equations over GF ( 2 ) 2 (2) via block Wiedemann algorithm. Math. Comp. 62, 205 (1994), 333–350.
