Winograd Convolution for DNNs: Beyond linear polynomials

Barbara Barabasz; David Gregg

arXiv:1905.05233·cs.LG·June 26, 2019

Winograd Convolution for DNNs: Beyond linear polynomials

Barbara Barabasz, David Gregg

PDF

TL;DR

This paper explores a broader set of Winograd algorithms for DNNs, demonstrating significant improvements in floating point accuracy and efficiency across various formats, surpassing traditional methods.

Contribution

It introduces and evaluates a wider range of Winograd algorithms for DNNs, showing they can enhance accuracy and reduce computations compared to existing approaches.

Findings

01

Up to 6.5 times better accuracy in fp16 for image recognition.

02

Fewer innermost loop multiplications in bf16 without accuracy loss.

03

Significant accuracy and efficiency improvements over traditional Winograd algorithms.

Abstract

Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many cases. We present results for three FP formats: fp32, fp16 and bf16 (a truncated form of fp32) using 2000 inputs from the ImageNet dataset. We found that in fp16 this approach gives us up to 6.5 times better image recognition accuracy in one important case while maintaining the same number of elementwise multiplication operations in the innermost loop. In bf16 the convolution can be computed using 5% fewer innermost loop multiplications than with currently used Winograd algorithms while keeping the accuracy of image recognition the same as for…

Tables3

Table 1. Table 1. Number of multiplications for single output point in 2 2 2 dimensional Winograd convolution algorithm for kernel 3 × 3 3 3 3\times 3 and outputs: 2 × 2 2 2 2\times 2 , 4 × 4 4 4 4\times 4 and 6 × 6 6 6 6\times 6 , for each number of the polynomials of the first and second degree used in CRT. In orange is Toom-Cook algorithm with all polynomials of the first degree.

output size		$2 \times 2$				$4 \times 4$				$6 \times 6$
No of $m_{i} (a)$	4	2	0	6	4	2	0	8	6	4	2	0
of degree $1$
No of $m_{i} (a)$	0	1	2	0	1	2	3	0	1	2	3	4
of degree $2$
Ratio	4	6.25	9	2.25	3.06	4	5.06	1.78	2.25	2.78	3.36	4

Table 2. Table 2. Percentage of image recognition for Toom-Cook convolution algorithm for kernel of the size 3 × 3 3 3 3\times 3 and outputs 4 × 4 4 4 4\times 4 , 6 × 6 6 6 6\times 6 and 8 × 8 8 8 8\times 8 in f p 32 𝑓 𝑝 32 fp32 , f p 16 𝑓 𝑝 16 fp16 and b f 16 𝑏 𝑓 16 bf16

method	dir	T-C( $4 \times 4$ )	T-C( $6 \times 6$ )	T-C( $8 \times 8$ )
ratio		$2.25$	$1.78$	$1.56$
$f p 32$	70	70	70	70
$f p 16$	70	10	0.05	0.05
$b f 16$	70	70	70	68

Table 3. Table 3. Percentage of image recognition for Winograd convolution algorithm with one polynomial of the second degree a 2 + 1 superscript 𝑎 2 1 a^{2}+1 for kernel of the size 3 × 3 3 3 3\times 3 and outputs 6 × 6 6 6 6\times 6 , 8 × 8 8 8 8\times 8 , 10 × 10 10 10 10\times 10 and 12 × 12 12 12 12\times 12 in f p 32 𝑓 𝑝 32 fp32 , f p 16 𝑓 𝑝 16 fp16 and b f 16 𝑏 𝑓 16 bf16 .

method	dir	W( $6 \times 6$ )	W( $8 \times 8$ )	W( $10 \times 10$ )	W( $12 \times 12$ )
ratio		$2.25$	$1.89$	$1.69$	$1.56$
$f p 32$	70	70	70	70	70
$f p 16$	70	65	0.1	0.05	0.05
$b f 16$	70	70	70	70	62

Equations4

s (a) = i \sum s_{i} (a) N_{i} (a) M_{i} (a) m o d M (a)

s (a) = i \sum s_{i} (a) N_{i} (a) M_{i} (a) m o d M (a)

A^{T} (G H G^{T} ⊙ B^{T} X B) A

A^{T} (G H G^{T} ⊙ B^{T} X B) A

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution

Full text

Winograd Convolution for DNNs: Beyond linear polynomials

Barbara Barabasz and David Gregg

School of Computer Science and Statistics, Trinity College Dublin, Dublin 2, Ireland

[email protected],[email protected]

Abstract.

Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many cases. We present results for three FP formats: $fp32$ , $fp16$ and $bf16$ (a truncated form of $fp32$ ) using 2000 inputs from the ImageNet dataset. We found that in $fp16$ this approach gives us up to $6.5$ times better image recognition accuracy in one important case while maintaining the same number of elementwise multiplication operations in the innermost loop. In $bf16$ the convolution can be computed using $5\%$ fewer innermost loop multiplications than with currently used Winograd algorithms while keeping the accuracy of image recognition the same as for direct convolution method.

Key words and phrases:

DNN and convolution and Winograd convolution and accuracy and floating point

1. Motivation

In DNNs, and especially in Convolutional Neural Networks (CNNs), a huge amount of time is spent computing convolution. The simple direct algorithm has a complexity of $O(m^{2})$ . In contrast, fast convolution algorithms, such as FFT, Toom-Cook, and Winograd convolution require fewer operations.

The family of Winograd algorithms is based on (1) transforming tiles of the input and kernel into a modulo polynomials domain (the Winograd domain), where (2) convolution becomes the elementwise multiplication (Hadamard product) with complexity $O(m)$ , and (3) transforming the result back to the orginal domain. Winograd’s method is not a convolution algorithm itself; instead it generates fast convolution algorithms that operate on fixed-sized tiles of input, kernel and output.

Winograd’s can be used to generate a wide variety of convolution algorithms with different trade-offs. It requires a set of polynomials as input to generate the convolution algorithm. These polynomials can be linear ( $degree=1$ ) or superlinear ( $degree>1$ ).

If only linear polynomials are used as inputs, Winograd’s method becomes much simpler, and the resulting algorithms are guaranteed to need only the theoretically minimum number of operations for the elementwise multiplication. The set of Winograd algorithms generated using only linear polynomials is also equivalent to the set of algorithms that can be generated using the Toom-Cook method. Toom-Cook is much simpler than the Winograd method, and as a result it is used to generate the algorithms used in many implementations of “Winograd” convolution.

The selected tile size is critical to the performance of Winograd convolution. A larger tile size increases the number of elementwise multiplication operations needed for that tile, but also computes more results per tile. Taking account of extra operations needed at the boundary of each tile, larger tiles reduce the number of elementwise multiplication operations per computed output point. However, the floating point error also grows exponentially with the tile size [1], so existing implementations of Winograd for DNNs typically use a small tile size.

In this paper we investigate the effect of higher-order polynomials on the accuracy of Winograd convolution for DNNs. Our experiments show that using order-2 polynomials can dramatically reduce the measured floating point error as compared to linear polynomials. However, higher order polynomials also increase the required number of multiplications in Hadamard product. This paper addresses the question: Is there a benefit in using the Winograd method with super-linear polynomials for DNNs, as compared to the simpler Toom-Cook method?

We make the following contributions:

•

We demonstrate how the Winograd algorithm with higher-order polynomials can be adapted to DNN convolution

•

We present experimental results for one and two dimensional Winograd convolution, with kernels of the size $3$ (1D) or $3\times 3$ (2D), and find that higher-order polynomials can significantly redude the FP error.

•

We show how using higher-order polynomials offer similar trade-offs between elementwise multiplications and FP accuracy, as compared with adjusting the block size.

•

We experimentally identify cases where using higher-order polynomials can improve recognition without increasing elementwise multiplications when using half precision ( $fp16$ ), and where we can improve the performance keeping the accuracy of image recognition in bfloat precision ( $bf16$ ).

2. Toom-Cook versus Winograd algorithm

2.1. Winograd algorithm definition

Convolution can be expressed as polynomial multiplication. Mapping the elements of kernel vector $h$ and input vector $x$ to coefficients of polynomials $h(a)$ and $x(a)$ respectively, the elements of output vector $s$ (convolution of $h$ and $x$ ) are equal to the coefficients of polynomial $s(a)=h(a)x(a)$ . The Winograd family of algorithms for convolution is based on Chinese Reminder Theorem (CRT) for polynomials [2]. It says that for polynomial $M(a)$ in ring of polynomials over a field $\mathbb{F}$ , $M(a)=m_{1}(a)\ldots m_{\ell}(a)$ where $m_{i}(a)$ are irreducible and pairwise coprime there exists $s(a)$ such as $deg(s(a))<deg(M(a))$ the unique solution of systems of congruences: $s(a)=s_{i}(a)\;mod\;m_{i}(a)$ and

[TABLE]

Where $N_{i}(a)M_{i}(a)+n_{i}(a)m_{i}(a)=1$ and $M_{i}(a)=M(a)/m_{i}(a)$ .

To compute the result of the convolution - the coefficients of the product of polynomials $h(a)$ and $x(a)$ - we put $s_{i}(a)=h_{i}(a)x_{i}(a)\;mod\;m_{i}(a)$ , $h_{i}(a)=h(a)\;mod\;m_{i}(a)$ and $x_{i}=x(a)\;mod\;m_{i}(a)$ . Operations modulo $m_{i}(a)$ are equal to finding the remainder from division by $m_{i}(a)$ ; so if we assume that all polynomials $m_{i}(a)$ are of the first degree then the results in modulo $m_{i}(a)$ arithmetic are all constant polynomials (scalars): $h_{i}(a)=h(a)\;mod\;m_{i}(a)=r_{h}$ , $x_{i}(a)=x(a)\;mod\;m_{i}(a)=r_{x}$ . Then we can perform the computations of $s_{i}(a)=h_{i}(a)x_{i}(a)\;mod\;m_{i}(a)$ for $i=1,\ldots,\ell$ as single multiplication: $s_{i}(a)=r_{h}r_{x}$ . These operations for all $i=1,\ldots,\ell$ are represented by Hadamard product of two vectors consist of elements $h_{1}(a),\ldots,h_{\ell}(a)$ and $x_{1}(a),\ldots,x_{\ell}(a)$ (see Figure 1).

The commonly used DNN two-dimensional Winograd convolution algorithm [5] uses the Matrix Exchange Theorem [3] and expresses the computations formula for in the following form:

[TABLE]

Where matrices $H$ and $X$ represents kernel and input values.

In this paper we use the modified version of the Winograd algorithm with polynomial $M(a)$ of degree equal to $deg(s(a))$ and pseudo-point $\infty$ . The exact algorithm to compute matrix elements and a more detailed theoretical description of the method can be found in [3] [9] [1].

To the best of our knowledge all Winograd algorithms used in DNNs require that all $m_{i}(a)$ are of the first degree (i.e linear). All such algorithms derived using Winograd’s method with linear polynomials can also be found using the Toom-Cook method ([10, 4]). Toom-Cook was anlyzed and applied to signal processing problems by S. Winograd in the $1980$ s. Winograd also proved that Toom-Cook guarantees that the generated convolution algorithm will use the theoretically minimum possible number of elementwise multiplications needed to compute convolution of size $n_{o}\times n_{o}$ with a kernel of size $n_{h}\times n_{h}$ . We denote these algorithms as $F(n_{o}\times n_{o},n_{h}\times n_{h})$ ) [12].

If we use polynomial $m_{i}(a)$ of degree $d>1$ , then the results of $h_{i}(a)=h(a)\;mod\;m_{i}(a)$ and $x_{i}(a)=x(a)\;mod\;m_{i}(a)$ are polynomials not scalars (see Figure 2). Thus to compute $s_{i}(a)$ we need to multiply two polynomials $h_{i}(a)$ , $x_{i}(a)$ rather than using simple scalar multiplication. However, to solve this subproblem (i.e. computing the coefficients of the product of two polynomials $h_{i}(a)$ and $x_{i}(a)$ ) we can apply any suitable algorithm, including the Toom-Cook algorithm $F_{T-C}(d\times d,d\times d)$ . All polynomials $m_{i}(a)$ used in the Winograd algorithm have to be pairwise coprime, and similarly all polynomials used in the Toom-Cook algorithm to solve the sub-problem also need to be pairwise coprime. But polynomials in the two different groups do not need to be coprime. This means that we can use the same polynomials of the first degree (points) in both algorithms. In the Figure 2 we can have $q_{i}=p_{j}$ . Some points offer superior floating point accuracy, such as [math], $-1$ and $1$ (polynomials $a$ , $a+1$ and $a-1$ ) [1].

The approach with polynomials, $m_{i}(a)$ of degree $d>1$ requires two steps of transformations (see Figure 2). Firstly, we transform input/kernel into the ”Polynomials Winograd domain”. That means to transform input/kernel into polynomials of the degree greater than zero. We then transform both those polynomials into scalars in the ”Winograd domain”. To perform the second transformation we use the Toom-Cook algorithm. Similarly, after computing Hadamard product we first transform the result into ”Polynomials Winograd domain” and after this into the original domain. Each of these transforms can be represented by a matrix which multiplied by the input/kernel/output to compute the transformation. We can merge the matrices for these two stages of transformation into a single transformation, allowing us to create three matrices $G^{W}$ , $B^{W}$ and $A^{W}$ applied to the kernel, input and result of the Hadamard product respectively. For the clarity, we denote matrices constructed for Toom-Cook algorithm as $G^{(T-C)}$ , $A^{(T-C)}$ and $B^{(T-C)}$ .

2.2. Constructing the Transform Matrices

2.2.1. Matrices $G^{W}$ and $A^{W}$

: We use the function $\mathrm{vec}(m(a))$ to map the polynomial $m(a)=m_{1}+m_{2}a+\ldots+m_{n}a^{n-1}$ to the vector: $\mathrm{vec}(m(a))=\left[\begin{matrix}m_{1}&\cdots&m_{n}\end{matrix}\right]^{T}$ We use $R_{m(a)}[p(a)]$ to denote the remainder from polynomial division of $p(a)$ by $m(a)$ . Rows of the matrix $G^{W}$ and $A^{W}$ which stand for transformation with polynomials of the first degree are identical to those in the Toom-Cook algorithm. (Note that we use matrices that are not scaled by factors $N_{i}$ ). To construct the submatrices that correspond to the transformation with the polynomial $m_{i}(a)$ of the degree $d$ higher than one, we have to compose the matrix $G$ with $G^{\prime}$ , where $G^{\prime}$ represents transformation to the “Polynomials’ Winograd domain” and the $G$ matrix stands for transformation to the “Winograd domain” and is equal to matrix $G^{(T-C)}$ of apropriate size ( $F_{T-C}(d\times d,d\times d)$ ). Analougusly, matrix $A^{W}=A^{(T-C)}A^{\prime}$ — where $A^{(T-C)}$ is generated by the Toom-Cook algorithm $F_{T-C}(d\times d,d\times d)$ — and $A^{\prime}$ stands for transformation into the “Polynomials’ Winograd domain” with polynomial $m_{i}(a)$ of the degree higher than $1$ . The last rows $A_{\ell}$ and $G_{\ell}$ represent the pseudo point $\infty$ needed to construct the modified version of the algorithm ([3], [1]). Below we present an example of the construction of matrices $A^{W}$ and $G^{W}$ for kernel of size $3\times 3$ and output of size $2\times 2$ , choosing polynomials $m_{1}(a)=a$ and $m_{2}(a)=a^{2}+ba+c$ . To solve the subproblem $F_{T-C}(2\times 2,2\times 2)$ we use Toom-Cook algorithm with points [math], $1$ .

$G_{2}=G^{(T-C)}G^{\prime}=\left[\begin{matrix}-1&0\\ 1&1\\ 0&1\end{matrix}\right]\left[\begin{matrix}1&0&-c\\ 0&1&-b\end{matrix}\right]=\left[\begin{matrix}-1&0&c\\ 1&1&-b-c\\ 0&1&-b\end{matrix}\right]$

$A_{2}=A^{(T-C)}A^{\prime}=\left[\begin{matrix}1&0\\ 1&1\\ 0&1\end{matrix}\right]\left[\begin{matrix}1&0&-c&bc\\ 0&1&-b&b^{2}-c\end{matrix}\right]=\left[\begin{matrix}1&0&-c&bc\\ 1&1&-b-c&bc+b^{2}-c\\ 0&1&-b&b^{2}-c\end{matrix}\right]$

$G^{W}=\left[\begin{matrix}&G_{1}&\\ &G_{2}&\\ 0&0&1\end{matrix}\right]=\left[\begin{matrix}1&0&0\\ -1&0&c\\ 1&1&-b-c\\ 0&1&-b\\ 0&0&1\end{matrix}\right]$

$A^{W}=\left[\begin{matrix}&&A_{1}&\\ &&A_{2}&\\ 0&0&0&1\end{matrix}\right]=\left[\begin{matrix}1&0&0&0\\ 1&0&-c&bc\\ 1&1&-b-c&bc+b^{2}-c\\ 0&1&-b&b^{2}-c\\ 0&0&0&1\end{matrix}\right]$

The exact algorithms to compute matrices $G^{W}$ and $B^{W}$ are presented in algorithm (1).

2.2.2. Matrix $B^{W}$

First we construct auxiliary matrix $C$ that includes blocks $C_{i}$ for $i=1,\cdots,\ell$ , where $\ell$ is the number of the polynomials $m_{i}(a)$ . The C matrix represents transformation from the “Polynomials’ Winograd domain” into the “Winograd domain”. The rows stand by transformation with polynomials $m_{i}(a)$ of the first degree are equal to identity matrix. Blocks stand for transformation with polynomial $m_{i}(a)$ of degree greater than $1$ represents transformation with matrix $B^{(T-C)}$ , generated for subproblem with $F_{T-C}(d\times d,d\times d)$ . A second matrix $E$ includes the rest of operations, that is modulo $M(a)$ (remainder) from product of polynomials $M_{i}(a)$ and the polynomial obtained from extended Euclidean algorithm $N_{i}(a)$ (see formula 1). Additional zeros in rows of matrix $E$ and column with coefficients of the polynomial $M_{i}(a)$ implement the modified version of the Winograd algorithm.

We present an example of constructing matrix $B^{W}$ for kernels of the size $3\times 3$ and outputs of size $2\times 2$ , chosing polynomials: $m_{1}(a)=a$ and $m_{2}(a)=a^{2}+ba+c$ (as in previous subsection). Matrix $B^{(T-C)}$ is generated by the Toom-Cook algorithm $F_{T-C}(2\times 2,2\times 2)$ with points $0,1$ .

$B^{(T-C)}=\left[\begin{matrix}-1&0&0\\ 1&1&-1\\ 0&0&1\end{matrix}\right]\qquad C_{2}=\left[\begin{matrix}-1&0&-c\\ 1&1&-b-1\end{matrix}\right]\qquad C=\left[\begin{matrix}1&0&0&0\\ 0&-1&0&-c\\ 0&1&1&-b-1\end{matrix}\right]$

Next, we construct the blocks of matrix $E$ . The polynomials get from extended Euclidean algorithm [2] are: $N_{1}=1$ , $N_{2}=-a$ .

$E_{1}=\left[\begin{matrix}1\\ b\\ c\end{matrix}\right]\qquad E_{2}=\left[\begin{matrix}0&0\\ 0&c\\ -1&b\end{matrix}\right]\qquad E=\left[\begin{matrix}1&0&0\\ b&0&c\\ c&-1&b\\ 0&0&0\end{matrix}\right]$

$EC=\left[\begin{matrix}1&0&0\\ b&0&c\\ c&-1&b\\ 0&0&0\end{matrix}\right]\left[\begin{matrix}1&0&0&0\\ 0&-1&0&-c\\ 0&1&1&-b-1\end{matrix}\right]=\left[\begin{matrix}1&0&0&0\\ b&c&c&-c(b+1)\\ c&b+1&b&c-b(b+1)\\ 0&0&0&0\end{matrix}\right]$

$B^{W}=\left[\begin{matrix}1&0&0&0&0\\ b&c&c&-c(b+1)&c\\ c&b+1&b&c-b(b+1)&b\\ 0&0&0&0&1\end{matrix}\right]$

2.3. Optimality of Winograd algorithm

Toom-Cook algorithms for $2$ dimensional convolution have an optimal number of multiplications $n=(n_{h}+n_{o}-1)^{2}$ for fixed $n_{h}$ and $n_{o}$ . While computing convolution in DNNs, we break our input into the pieces of the size equal to algorithm input tile. This results in overlap of input tiles at boundaries. The exact number of overlapping input values for whole input depends on the kernel and input/output sizes (see description in [5]). We express the performance of the algorithm as the ratio of the number of multiplications per single output point. Thus, Toom-Cook algorithm $F_{T-C}(2\times 2,3\times 3)$ requires $16$ multiplications to compute $4$ output points, so we have ratio equal to $4$ . For algorithm $F_{T-C}(4\times 4,3\times 3)$ , the $ratio=2.25$ . For Toom-Cook convolution with a fixed kernel size the $ratio$ decreases with tile size. The bigger input/output tile, the fewer elementwise multiplications are needed. The elementwise multiplication dominates the execution time of DNN convolution, so reductions in these multiply operations translate to reduced execution time. Unfortunately, with increasing the input/output size the floating point error of the computations increase exponentially [1].

When we apply Toom-Cook algorithm $F_{T-C}(n_{o}\times n_{o},n_{h}\times n_{h})$ the $ratio$ is equal to $(n_{h}+n_{o}-1)^{2}/n_{o}^{2}$ . In the Winograd method, as we can see from matrix construction, introducing polynomials $m_{i}(a)$ of the degree greater than $1$ results in larger matrix sizes, which means the bigger number of multiplications. Every Toom-Cook algorithm $F_{T-C}(d\times d,d\times d)$ used to solve subproblem in Winograd algorithm requires $2d-1$ polynomials of the first degree. The bigger number and higher degree polynomials we use the more multiplications per output point are required. To compute $F(2\times 2,3\times 3)$ we can use:

•

$4$ polynomials of the first degree with $ratio=16/4=4$ (Toom-Cook algorithm)

•

$2$ polynomials of the first degree and $1$ of the second degree with $ratio=(2+3)^{2}/4=6.25$

•

$1$ polynomial of the first degree and $1$ of the third degree with $ratio=(1+5)^{2}/4=9$

•

$2$ polynomials of the second degree with $ratio=(2*3)^{2}/4=9$

•

$1$ polynomial of the fourth degree with $ratio=7^{2}/4=12.25$

We can notice that in above example using the polynomial $m_{i}(a)$ of the $4$ th degree do not change input (mapped to the polynomial of the $3$ rd degree) and kernel (mapped to the polynomial of the $2$ nd degree) pending transformations, so this case only introduce additional multiplications into convolution computations. Analogously using polynomial $m_{i}(a)$ of the $3$ rd degree does not change the kernel. The Winograd method for fixed kernel and output size allows us to construct algorithms with different $ratio$ s, while the Toom-Cook method has a constant $ratio$ for given $n_{h}$ and $n_{o}$ . Thus, for a fixed kernel size, we can construct sets of Winograd matrices with the same $ratio$ but other output/input size. For example for $F_{T-C}(4\times 4,3\times 3)$ $ratio=36/16=2.25$ and $F_{W}(6\times 6,3\times 3)$ , with $6$ polynomials of the first degree, and one polynomial of the second degree, we have the same $ratio=81/36=2.25$ see table 1. Given these choices with the same computational $ratio$ , we can investigate the floating point error of such algorithms and use the more accurate one.

3. Tests Results

3.1. Random data

We tested the accuracy of the Winograd convolution algorithm for the kernel of the size $3$ (1D) and $3\times 3$ (2D). We studied a range of output tile sizes from $2$ – $8$ (1D) and $2\times 2$ – $8\times 8$ (2D). We run our initial experiments over $5000$ loops where kernel and input values were choosen randomly from range $(-1,1)$ with a normal distribution. We computed the Euclidean error of Winograd convolution performed in $fp32$ and compared it with the direct convolution in $fp64$ .

We investigated Winograd convolution algorithm with the most promising configurations of polynomials of the first and second degree (as we use the kernel of size $3$ or $3\times 3$ ). The best results for each computation $ratio$ and polynomial degree configuration are presented in figure 3. We construct the first degree polynomials using known good root points: [math], $-1$ , $1$ , $-1/2$ , $2$ , $1/2$ , $-2$ , $-1/4$ , $4$ [1]. As second degree polynomials, we considered those with the coefficients equal to [math], $-1$ and $1$ , coprime with the polynomials of the first degree. That is: $a^{2}+1$ , $a^{2}+a+1$ and $a^{2}-a+1$ . To solve the subproblem for the polynomial of degree greater than $1$ , we use Toom-Cook convolution algorithm $F_{T-C}(2\times 2,2\times 2)$ and root points [math], $-1$ and $\infty$ . In our tests, we noticed that in some cases (up to $ratio$ around $1.9$ for $1D$ and $3.5$ for $2D$ ) the Winograd algorithm with one polynomial of the second degree gives a smaller floating point error than Toom-Cook, see Figure 3). When we use only one polynomial of the second degree we found that $a^{2}+1$ works the best as it provides only two coefficients, not three.

3.2. Experiments with real data ImageNet on VGG16

We next run experiments for the vgg16 CNN [6] (using Tensorflow Slim) with thirteen 2D convolution layers, with kernel size $3\times 3$ . As inputs we use $2000$ images from the ImageNet validation set. The computations were done in $fp32$ . We also simulated $fp16$ and $bf16$ by performing the operations in single precision and casting the results to the lower precision.

We tested the Toom-Cook algorithms with outputs $4\times 4$ , $6\times 6$ and $8\times 8$ . This means the $ratio$ of multiplications per single ouptut point equals $2.25$ , $1.78$ and $1.56$ respectively (see Table 1). For comparision we choose the Winograd algorithm with one polynomial of the second degree for even output sizes, from $6\times 6$ up to $12\times 12$ . The $ratio$ of multiplications per single output point are equal to $2.25$ , $1.89$ , $1.69$ and $1.56$ respectively. In our initial tests on random data, we have found that using the polynomial of the second degree $a^{2}+1$ works best. Polynomials of the first degree for $F_{W}(n_{o}\times n_{o},3\times 3)$ were constructed with the root points used for $F_{T-C}((n_{o}-2)\times(n_{o}-2),3\times 3)$ . For given output and kernel sizes we can construct Winograd algorithms with different computational $ratio$ s. In our tests, we used Winograd algorithms with only one polynomial of degree 2. We could achieve better accuracy by using more degree-2, but this would be at the cost of a worse computational $ratio$ . We focus on the cases where the image recognition accuracy decreases – $ratio$ equal to $2.25$ in $fp16$ and $ratio$ between $1.78$ and $1.56$ in $bf16$ . We do not present the all possible results, e.g. for $F_{W}(12\times 12,3\times 3)$ with 2 polynomials of the second degree ( $ratio=1.78$ ), but our results are indicative.

We looked at the percentage of image recognition (top-1) for vgg16 network with Winograd convolution layers in comparison to the same network with direct convolution using the same floating point precision. In Tables 2, 3 we present the percentage accuracy of image recognition for different FP precision. For the output sizes we consider, we do not see any changes using $fp32$ . In $fp16$ , all investigated Toom-Cook algorithms failed. In $bf16$ the percentage of image recognition is the same as for direct convolution for Toom-Cook algorithm with output $6\times 6$ ( $ratio$ equal to $1.78$ ), but for output size $8\times 8$ the accuracy decreases.

With $fp16$ , we see that using Winograd convolution instead of Toom-Cook with the same performance $ratio$ (equal to $2.25$ ), increases the recognition accuracy from $10\%$ to $65\%$ . The main problem we face with $fp16$ is that it cannot store the same range of values as $fp32$ . Then using the same good root points (like [math], $-1$ and $1$ ) more than once results in lower intermediate values, and less likelihood of overflow.

Using $bf16$ , the decrease in image recognition appears for bigger input sizes than in $fp16$ . The $bf16$ format allow us to represent nearly the same range of values as single precision. However, the lower number of bits results in lower accuracy of values representation and larger floating point error from operations. In our tests we can observe the impact of this for network with Toom-Cook convolution algorithm with output of the size $8\times 8$ . We have not found a configuration of polynomials that would give us the accuracy of image recognition better than $68\%$ with the $ratio$ equal to $1.56$ . We construct the Winograd algorithm with the accuracy of image recognition equal to $70\%$ (the same accuracy we get using of a direct convolution algorithm) with $ratio=1.69$ .

4. Related work

In the DNN research literature the term “Winograd convolution algorithm” is used for both Winograd and Toom-Cook convolution, and in practice the Toom-Cook algorithm is used to generate the convolution matrices. The general Winograd algorithm described in this paper is not explored a lot in literature. We can find a description of the approach in Winograd [12], but not for the multi-channel multiple kernel convolution used for DNNs. A simple example how to construct matrices is presented in [3]. A more general and detailed description can be found in [9]. Selesnick and Burrus [8] considered cyclic convolution methods using cyclotomic polynomials in their theoretical work.

Meng and Brothers in [7] apply the idea of using complex points $i$ and $-i$ (root points of polynomial $a^{2}+1$ ) for quantization network. We present a general definition of the method and present floating point accuracy for a couple of different versions of the algorithm.

There is some work done on the improvement of the FP accuracy of Winograd (Toom-Cook) convolution for DNNs. Vincent at al. [11] present the result for one set of matrices, that scaling matrices $G$ and $A^{T}$ give the more accurate results. Scaling improves the conditioning of used matrices but it is not necessarily always equivalent to decrease the floating point error of computation, particularly for the small size of matrices used in DNNs.

There are also a couple of methods that allow to increasing the accuracy of dot product computations for matrices transformation, such as more accurate summation algorithms, Strassen matrix multiplication [13], etc. However, they require more operations for the transforms, for sorting elements, or for compensated summation, and/or make the implementation more complicated. In contrast, our approach does not require additional operations for the transformations. All of those methods for improving FP accuracy could also be used together with the presented method to reduce floating point error even more. These include pairwise summation over channels, Huffman based summation method and mixed precision computations proposed in [1].

5. Conclusions

This paper asks the question: Is there a benefit in using the Winograd method with superlinear polynomials for DNNs, as compared to the simpler Toom-Cook method (which is equivalent to Winograd with linear polynomials)? We describe the construction of Winograd transformation matrices in general case. We show that the main benefit of using superlinear polynomials is that the same good root points can be used multiple times, which improves FP accuracy. The Toom-Cook method allows a trade-off of elementwise multiplications against FP accuracy by varying the tile size. The presented Winograd method offers an larger space of trade-offs between computation and accuracy using higher order polynomials. Thus, it allows us find attractive trade-offs that are not available using Toom-Cook.

We find that in $bf16$ precision we can construct an algorithm that maintains the same accuracy of image recognition as Toom-Cook but has better $ratio$ of elementwise multiplications per single output point than Toom-Cook. In $fp16$ precision we can obtain better accuracy using Winograd convolution algorithm with one polynomial of the second degree, as compared to Toom-Cook (for the case kernel $3\times 3$ , output $4\times 4$ ) with the same $ratio$ of number of elementwise multiplications per output point. The presented Winograd convolution algorithm does not require additional operations in the transformation to/from the ”Winograd domain”, and although the Winograd method itself is complex, the generated convolution algorithm does not require a more advanced implementation.

Acknowledgements

This work was supported by Science Foundation Ireland grant 12/IA/1381. We also extend our thanks to Andrew Mundy from Arm Research for his contribution.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Barabasz, B., Anderson, A., Soodhalter, K.M., Gregg, D.: Error analysis and improving the accuracy of winograd convolution for dnns. Co RR abs/1803.10986 (2018), http://arxiv.org/abs/1803.10986
2[2] Biggs, N.L.: Discrete Mathematics. Oxford University Press, New York, NY, USA, 2nd. edn. (2002)
3[3] Blahut, R.E.: Fast Algorithms for Signal Processing. Cambridge University Press, New York, NY, USA (2010)
4[4] Cook, S.A.: On the Minimum Computation Time of Functions. Ph.D. thesis, Harvard University, Cambridge, Mass. (1966)
5[5] Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4013–4021. IEEE, Las Vegas, Nevada (2016)
6[6] Liu, S., Deng, W.: Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR). pp. 730–734 (Nov 2015). https://doi.org/10.1109/ACPR.2015.7486599
7[7] Meng, L., Brothers, J.: Efficient winograd convolution via integer arithmetic. Co RR abs/1901.01965 (2019)
8[8] Selesnick, I.W., Burrus, C.S.: Extending winograd’s small convolution algorithm to longer lengths. In: 1994 IEEE International Symposium on Circuits and Systems, ISCAS 1994, London, England, UK, May 30 - June 2, 1994. pp. 449–452 (1994)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Winograd Convolution for DNNs: Beyond linear polynomials

Abstract.

Key words and phrases:

1. Motivation

2. Toom-Cook versus Winograd algorithm

2.1. Winograd algorithm definition

2.2. Constructing the Transform Matrices

2.2.1. Matrices GWG^{W}GW and AWA^{W}AW

2.2.2. Matrix BWB^{W}BW

2.3. Optimality of Winograd algorithm

3. Tests Results

3.1. Random data

3.2. Experiments with real data ImageNet on VGG16

4. Related work

5. Conclusions

Acknowledgements

2.2.1. Matrices $G^{W}$ and $A^{W}$

2.2.2. Matrix $B^{W}$