Fast Flow Reconstruction via Robust Invertible nxn Convolution

Thanh-Dat Truong; Khoa Luu; Chi Nhan Duong; Ngan Le; Minh-Triet; Tran

arXiv:1905.10170·cs.CV·August 9, 2022

Fast Flow Reconstruction via Robust Invertible nxn Convolution

Thanh-Dat Truong, Khoa Luu, Chi Nhan Duong, Ngan Le, Minh-Triet, Tran

PDF

Open Access

TL;DR

This paper introduces an invertible n x n convolution that enhances flow-based generative models by increasing flexibility, reducing parameters, and improving performance on multiple datasets.

Contribution

It proposes a novel invertible n x n convolution that overcomes the limitations of 1 x 1 convolutions in flow models, with fewer parameters and better results.

Findings

01

Improved generative model performance on CIFAR-10, ImageNet, Celeb-HQ.

02

Fewer parameters than standard convolutions.

03

Enhanced flexibility over invertible 1 x 1 convolutions.

Abstract

Flow-based generative models have recently become one of the most efficient approaches to model data generation. Indeed, they are constructed with a sequence of invertible and tractable transformations. Glow first introduced a simple type of generative flow using an invertible $1 \times 1$ convolution. However, the $1 \times 1$ convolution suffers from limited flexibility compared to the standard convolutions. In this paper, we propose a novel invertible $n \times n$ convolution approach that overcomes the limitations of the invertible $1 \times 1$ convolution. In addition, our proposed network is not only tractable and invertible but also uses fewer parameters than standard convolutions. The experiments on CIFAR-10, ImageNet and Celeb-HQ datasets, have shown that our invertible $n \times n$ convolution helps to improve the performance of generative models significantly.

Equations20

p_{\mathcal{X}}(\mathbf{x})=p_{\mathcal{Z}}(\mathbf{z})\Big{|}\operatorname{det}\Big{(}\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}}\Big{)}\Big{|},

p_{\mathcal{X}}(\mathbf{x})=p_{\mathcal{Z}}(\mathbf{z})\Big{|}\operatorname{det}\Big{(}\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}}\Big{)}\Big{|},

\begin{split}\mathcal{L}(\mathcal{X})&=-_{\mathbf{x}\in\mathcal{X}}\log p_{\mathcal{X}}(\mathbf{x})\\ &=-_{\mathbf{x}\in\mathcal{X}}\Bigg{[}\log p_{\mathcal{Z}}(\mathbf{z})+\log\Big{|}\operatorname{det}\Big{(}\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}}\Big{)}\Big{|}\Bigg{]}.\end{split}

\begin{split}\mathcal{L}(\mathcal{X})&=-_{\mathbf{x}\in\mathcal{X}}\log p_{\mathcal{X}}(\mathbf{x})\\ &=-_{\mathbf{x}\in\mathcal{X}}\Bigg{[}\log p_{\mathcal{Z}}(\mathbf{z})+\log\Big{|}\operatorname{det}\Big{(}\frac{\partial f(\mathbf{x})}{\partial\mathbf{x}}\Big{)}\Big{|}\Bigg{]}.\end{split}

z x \sim p_{Z} (z) = f^{- 1} (z) .

z x \sim p_{Z} (z) = f^{- 1} (z) .

\begin{split}\mathcal{L}(\mathcal{X})&=-_{\mathbf{x}\in\mathcal{X}}\log p_{\mathcal{X}}(\mathbf{x})\\ &=-_{\mathbf{x}\in\mathcal{X}}\Bigg{[}\log p_{\mathcal{Z}}(\mathbf{z})+\sum_{k=1}^{K}\log\Big{|}\operatorname{det}\Big{(}\frac{\partial\mathbf{h}_{k}}{\partial\mathbf{h}_{k-1}}\Big{)}\Big{|}\Bigg{]}\end{split}

\begin{split}\mathcal{L}(\mathcal{X})&=-_{\mathbf{x}\in\mathcal{X}}\log p_{\mathcal{X}}(\mathbf{x})\\ &=-_{\mathbf{x}\in\mathcal{X}}\Bigg{[}\log p_{\mathcal{Z}}(\mathbf{z})+\sum_{k=1}^{K}\log\Big{|}\operatorname{det}\Big{(}\frac{\partial\mathbf{h}_{k}}{\partial\mathbf{h}_{k-1}}\Big{)}\Big{|}\Bigg{]}\end{split}

\begin{split}\mathbf{Y}&=\mathbf{W}\star\mathbf{X}=\Big{[}{\mathbf{W}}_{:,:,1}\;{\mathbf{W}}_{:,:,2}\;\cdots\;{\mathbf{W}}_{:,:,k}\Big{]}\times\begin{bmatrix}\mathbf{X}^{1}_{:,:,:}\\[5.0pt] \mathbf{X}^{2}_{:,:,:}\\[5.0pt] \vdots\\[5.0pt] \mathbf{X}^{K}_{:,:,:}\end{bmatrix}\\ &=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathbf{X}^{k}_{:,:,:}=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}_{k}(\mathbf{X}),\end{split}

\begin{split}\mathbf{Y}&=\mathbf{W}\star\mathbf{X}=\Big{[}{\mathbf{W}}_{:,:,1}\;{\mathbf{W}}_{:,:,2}\;\cdots\;{\mathbf{W}}_{:,:,k}\Big{]}\times\begin{bmatrix}\mathbf{X}^{1}_{:,:,:}\\[5.0pt] \mathbf{X}^{2}_{:,:,:}\\[5.0pt] \vdots\\[5.0pt] \mathbf{X}^{K}_{:,:,:}\end{bmatrix}\\ &=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathbf{X}^{k}_{:,:,:}=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}_{k}(\mathbf{X}),\end{split}

\begin{split}\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}_{k}(\mathbf{X})&=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}(\mathbf{X})\\ &=\Bigg{(}\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\Bigg{)}\times\mathcal{S}(\mathbf{X}).\end{split}

\begin{split}\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}_{k}(\mathbf{X})&=\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\times\mathcal{S}(\mathbf{X})\\ &=\Bigg{(}\sum_{k=1}^{K}{\mathbf{W}}_{:,:,k}\Bigg{)}\times\mathcal{S}(\mathbf{X}).\end{split}

S (X_{c, i, j}) = α_{c} X_{c, i, j} + β_{c},

S (X_{c, i, j}) = α_{c} X_{c, i, j} + β_{c},

X_{c, i, j} = \frac{S ( X _{c, i, j} ) - β _{c}}{α _{c}} .

X_{c, i, j} = \frac{S ( X _{c, i, j} ) - β _{c}}{α _{c}} .

J = \frac{\partial S ( X )}{\partial X} = \frac{\partial S ( X _{1, 1, 1} )}{\partial X _{1, 1, 1}} 0 ⋮ 0 0 \frac{\partial S ( X _{1, 1, 2} )}{\partial X _{1, 1, 2}} ⋮ 0 \dots \dots ⋱ \dots 00 ⋮ \frac{\partial S ( X _{C, H, W} )}{\partial X _{C, H, W}} = α_{1} 0 ⋮ 0 0 α_{1} ⋮ 0 \dots \dots ⋱ \dots 00 ⋮ α_{c} .

J = \frac{\partial S ( X )}{\partial X} = \frac{\partial S ( X _{1, 1, 1} )}{\partial X _{1, 1, 1}} 0 ⋮ 0 0 \frac{\partial S ( X _{1, 1, 2} )}{\partial X _{1, 1, 2}} ⋮ 0 \dots \dots ⋱ \dots 00 ⋮ \frac{\partial S ( X _{C, H, W} )}{\partial X _{C, H, W}} = α_{1} 0 ⋮ 0 0 α_{1} ⋮ 0 \dots \dots ⋱ \dots 00 ⋮ α_{c} .

\begin{split}\operatorname{det}\Bigg{(}\frac{\partial\mathcal{S}(\mathbf{X})}{\partial\mathbf{X}}\Bigg{)}&=\prod_{c=1}^{C}\alpha_{c}^{H\times W}\\ \log\Bigg{|}\operatorname{det}\Bigg{(}\frac{\partial\mathcal{S}(\mathbf{X})}{\partial\mathbf{X}}\Bigg{)}\Bigg{|}&=H\times W\times\sum_{c=1}^{C}\log|\alpha_{c}|.\end{split}

\begin{split}\operatorname{det}\Bigg{(}\frac{\partial\mathcal{S}(\mathbf{X})}{\partial\mathbf{X}}\Bigg{)}&=\prod_{c=1}^{C}\alpha_{c}^{H\times W}\\ \log\Bigg{|}\operatorname{det}\Bigg{(}\frac{\partial\mathcal{S}(\mathbf{X})}{\partial\mathbf{X}}\Bigg{)}\Bigg{|}&=H\times W\times\sum_{c=1}^{C}\log|\alpha_{c}|.\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computational Physics and Python Applications · Music and Audio Processing

MethodsInvertible 1x1 Convolution · Activation Normalization · Affine Coupling · Normalizing Flows · GLOW · 1x1 Convolution · Convolution

Full text

Abstract

Flow-based generative models have recently become one of the most efficient approaches to model data generation. Indeed, they are constructed with a sequence of invertible and tractable transformations. Glow first introduced a simple type of generative flow using an invertible $1\times 1$ convolution. However, the $1\times 1$ convolution suffers from limited flexibility compared to the standard convolutions. In this paper, we propose a novel invertible $n\times n$ convolution approach that overcomes the limitations of the invertible $1\times 1$ convolution. In addition, our proposed network is not only tractable and invertible but also uses fewer parameters than standard convolutions. The experiments on CIFAR-10, ImageNet and Celeb-HQ datasets, have shown that our invertible $n\times n$ convolution helps to improve the performance of generative models significantly.

keywords:

flow-based generative model; invertible $n\times n$ convolution; invertible and tractable transformations

\pubvolume

1 \issuenum1 \articlenumber0

\datereceived31 May 2021 \dateaccepted6 July 2021 \TitleFast Flow Reconstruction via Robust Invertible $n\times n$ Convolution \TitleCitationFast Flow Reconstruction via Robust Invertible $n\times n$ Convolution \AuthorThanh-Dat Truong 1,∗, Chi Nhan Duong 2, Minh-Triet Tran 3, Ngan Le 1 and Khoa Luu 1 \AuthorNamesThanh-Dat Truong Chi Nhan Duong, Minh-Triet Tran, Ngan Le and Khoa Luu \AuthorCitationTruong, T.-D.; Duong, C.N.; Tran, M.-T.; Le, N.; Luu, K. \corresCorrespondence: [email protected]; Tel.: +1-479-404-0772

1 Introduction

Supervised deep learning models have recently achieved numerous breakthrough results in various applications, for example, Image Classification He et al. (2016); Huang et al. (2017); Sun et al. (2018), Object Detection, Liu et al. (2016); Redmon et al. (2016); Lin et al. (2017), Face Recognition Le_JPR2015 ; Xu_IJCB2011 ; Xu_TIP2015 ; Luu_IJCB2011 ; Luu_BTAS2009 ; Duong_ICASSP2011 ; Luu_FG2011 ; Chen_FG2011 , Image Segmentation Chen et al. (2017); Long et al. (2015) and Generative Model Luu_CAI2011 ; Mirza and Osindero (2014); Karras et al. (2017); Duong_Long ; Luu_ROBUST2008 ; Duong_TNVP . However, these methods usually require a huge number of annotated data, which is highly expensive. In order to tackle the requirement of large annotations, generative models have become a feasible solution. The main objective of generative models is to learn the hidden dependencies that exist in the realistic data so that they can extract meaningful features and variable interactions to synthesize new realistic samples without human supervision or labeling. Generative models can be used in numerous applications such as anomaly detection Mattia et al. (2019), image inpainting Yu et al. (2018), data generation Karras et al. (2017, 2019), super-resolution Ledig et al. (2017), face synthesis Duong_TNVP ; Duong_automatic ; Duong_2018 , and so forth. However, learning generative models is an extremely challenging process due to high-dimensional data.

There are two types of generative models extensively deployed in recent years, including likelihood-based methods Kingma and Dhariwal (2018); Kingma et al. (2016); Dinh et al. (2015, 2017) and Generative Adversarial Networks (GANs) Goodfellow et al. (2014). Likelihood-based methods have three main categories: Autoregressive models Kingma et al. (2016), variational autoencoders (VAEs) Kingma and Welling (2014), and flow-based models Kingma and Dhariwal (2018); Dinh et al. (2017, 2015). The flow-based generative model is constructed using a sequence of invertible and tractable transformations, the model explicitly learns the data distribution and therefore the loss function is simply a negative log-likelihood.

The flow-based model was first introduced in Dinh et al. (2015) and later extended in RealNVP Dinh et al. (2017). These methods introduced an affine coupling layer that is invertible and tractable based on Jacobian determinant. As the design of the coupling layers, at each stage, only a subset of data is transformed while the rest is required to be fixed. Therefore, they may be limited in flexibility. To overcome this limitation, coupling layers are alternated with less complex transformations that manipulate on all dimensions of the data. In RealNVPDinh et al. (2017), the authors use a fixed channel permutation using fixed checkerboard and channel-wise masks. Kingma et al. Kingma and Dhariwal (2018) simplifies the architecture by replacing the reverse permutation operation on the channel ordering with invertible $1\times 1$ convolutions.

However, the $1\times 1$ convolutions are not flexible enough in these scenarios. It is extremely hard to compute the inverse form of the standard $n\times n$ convolutions, and this step usually produces high computational costs. There are prior approaches that design the invertible $n\times n$ convolutions by using emerging convolution Hoogeboom et al. (2019), periodic convolutions Hoogeboom et al. (2019), autoregressive flow Papamakarios et al. (2017) or stochastic approximation Behrmann et al. (2019); Kim et al. (2021); Chen et al. (2019). In this paper, we propose an approach to generalize an invertible $1\times 1$ convolution to a more general form of $n\times n$ convolution. Firstly, we reformulate the standard convolution layer by shifting the inputs instead of the kernels. Then, we propose an invertible shift function that is a tractable form of Jacobian determinant. Through the experiments on CIFAR-10 Krizhevsky (2009), ImageNet Deng et al. (2009) and Celeb-HQ Liu et al. (2015) datasets, we prove that our proposals are significant and efficient for high-dimensional data. Figure 1 illustrates the advantages of our approach with high-resolution synthesized images.

Contributions: This work generalizes the invertible $1\times 1$ convolution to an invertible $n\times n$ convolution by reformulating the convolution layer using our proposed invertible shift function. Our contributions can be summarized as follows:

•

Firstly, by analyzing the standard convolution layer, we reformulate its equation into a form such that, rather than shifting the kernels during the convolution process, shifting the input provides equivalent results.

•

Secondly, we propose a novel invertible shift function that mathematically helps to reduce the computational cost of the standard convolution while keeping the range of the receptive fields. The determinant of the Jacobian matrix produced by this shift function can be computed efficiently.

•

Thirdly, evaluations of several datasets on both objects and faces have shown the generalization of the proposed $n\times n$ convolution using our proposed novel invertible shift function.

2 Related Work

The generative models can be divided into two groups, that is, Generative Adversarial Networks and Flow-based Generative Models. In the first group, Generative Adversarial Networks Goodfellow et al. (2014) provide an appropriate solution to model the data generation. The discriminative model learns to distinguish the real data from the fake samples produced using a generative model. Two models are trained as they are playing a mini-max game. Meanwhile, in the second group, the Flow-based Generative Models Kingma and Dhariwal (2018); Dinh et al. (2017, 2015) are constructed using a sequence of invertible and tractable transformations. Unlike GAN, the model explicitly learns the data distribution $p(\mathbf{x})$ and therefore the loss function is efficiently employed with the log-likelihood.

In this section, we discuss several types of flow-based layers that are commonly used in flow-based generative models. An overview of several invertible functions is provided in the Table 2. In particular, all functions easily obtain the reverse function and tractability of a Jacobian determinant. The symbols $\odot,/$ denote element-wise multiplication and division. $h,w$ denotes the height and width of the input/output. $c,i,j$ are the depth channel index and spatial indices, respectively.

{specialtable}

[H]

\widetable

Comparative invertible functions in several generative normalizing flows.

Description Function Reverse Function Log-Determinant

ActNorm Kingma and Dhariwal (2018) $\mathbf{y}=\mathbf{x}\odot\gamma+\beta$ $\mathbf{x}=(\mathbf{y}-\beta)/\gamma$ $\sum\log|\gamma|$

Affine Coupling Dinh et al. (2017) $\mathbf{x}=[\mathbf{x}_{a},\mathbf{x}_{b}]$ $\mathbf{y}=[\mathbf{y}_{a},\mathbf{y}_{b}]$ $\sum\log|s(\mathbf{x}_{b})|$

$\mathbf{y}_{a}=\mathbf{x}_{a}\odot s(\mathbf{x}_{b})+t(\mathbf{x}_{b})$ $\mathbf{x}_{a}=[\mathbf{y}_{a}-t(\mathbf{y}_{b})]/s(\mathbf{y}_{b})$

$\mathbf{y}=[\mathbf{y}_{a}\mathbf{x}_{b}]$ $\mathbf{x}=[\mathbf{x}_{a}\mathbf{y}_{b}]$

$1\times 1$ conv Kingma and Dhariwal (2018) $\mathbf{y}_{:,i,j}=\mathbf{W}\mathbf{x}_{:,i,j}$ $\mathbf{x}_{:,i,j}=\mathbf{W}^{-1}\mathbf{y}_{:,i,j}$ $h.w.\log|\operatorname{det}\mathbf{W}|$

Our Shift Function $\mathbf{y}_{c,i,j}=\alpha_{c}\mathbf{x}_{c,i,j}+\beta_{c}$ $\mathbf{x}_{c,i,j}=[\mathbf{y}_{c,i,j}-\beta_{c}]/\alpha_{c}$ $h.w.\sum_{c}\log|\alpha_{c}|$

{paracol}2 \switchcolumn

Coupling Layers: NICE Dinh et al. (2015) and RealNVP Dinh et al. (2017) presented coupling layers with a normalizing flow by stacking a sequence of invertible bijective transformation functions. The bijective function is designed as an affine coupling layer, which is a tractable form of Jacobian determinant. RealNVP can work in a multi-scale architecture to build a more efficient model for large inputs. To further improve the propagation step, the authors applied batch normalization and weight normalization during training. Later, Ho et. al. Ho et al. (2019) presented a continuous mixture cumulative distribution function to improve the density modeling of coupling layers. In addition to improving the expressiveness of transformations of coupling layers, Ho et al. (2019) utilized multi-head self-attention layers Vaswani et al. (2017) in the transformations.

Inverse Autoregressive Convolution: Germain et al. Germain et al. (2015) introduced autoregressive autoencoders by constructing an extension of a non-variational autoencoder that can estimate distributions and is straightforward in computing their Jacobian determinant. Masked autoregressive flow Papamakarios et al. (2017) is a type of normalizing flow, where the transformation layer is built as an autoregressive neural network. Inverse autoregressive flow Kingma et al. (2016) formulates the conditional probability of the target variable as an autoregressive model.

Invertible $1\times 1$ Convolution: Kingma et al. Kingma and Dhariwal (2018) proposed simplifying the architecture via invertible $1\times 1$ convolutions. Learning a permutation matrix is a discrete optimization that is not amenable to gradient ascent. However, the permutation operation is simply a special case of a linear transformation with a square matrix. We can pursue this work with convolutional neural networks, as permuting the channels is equivalent to a $1\times 1$ convolution operation with an equal number of input and output channels. Therefore, the authors replace the fixed permutation with learned $1\times 1$ convolution operations.

Activation Normalization: Kingma and Dhariwal (2018) performs an affine transformation using scale and bias parameters per channel. This layer simply shifts and scales the activations with data-dependent initialization that normalizes the activations given an initial minibatch of data. This allows the scaling down of the minibatch size to 1 (for large images) and the scaling up of the size of the model.

Invertible $n\times n$ Convolution: Since the invertible $1\times 1$ convolution is not flexible, Hoogeboom et al. Hoogeboom et al. (2019) proposed an invertible $n\times n$ convolution generalized from the $1~{}\times~{}1$ convolutions. The authors presented two methods to produce the invertible convolutions: (1) Emerging Convolution and (2) Invertible Periodic Convolutions. Emerging Convolution is obtained by chaining specific invertible autoregressive convolutions Kingma et al. (2016) and speeding up this layer through the use of an accelerated parallel inversion module implemented in Cython. Invertible Periodic Convolutions transform data to the frequency domain via Fourier transform; this alternative convolution has a tractable Jacobian determinant and inverse. However, these invertible $n\times n$ convolutions require more parameters; therefore, these have an additional computational cost compared to our proposed method.

Lipschitz Constant: Behrmann et al. Behrmann et al. (2019) developed a theory that any residual blocks satisfying the Lipschitz Constant can be invertible. Hence, Behrmann et al. proposed an invertible residual network (i-ResNet) as a normalizing flow-based model. Similar to Hoogeboom et al. (2019); Dinh et al. (2017, 2015); Kingma and Dhariwal (2018), i-ResNet is learned by optimizing the negative log-likelihood in which the inverse flow and Jacobian determinant of the residual block can be efficiently approximated by the stochastic methods. Inheriting the success of Lipschitz theory, Kim et al. Kim et al. (2021) proposed an $L_{2}$ self-attention that allows the self-attention of the Transformer networks Vaswani et al. (2017) to be invertible.

3 Background

3.1 Flow-Based Generative Model

Let $\mathbf{x}$ be a high-dimensional vector with unknown true distribution $\mathbf{x}\sim p_{\mathcal{X}}(\mathbf{x})$ , $x\in\mathcal{X}$ , a simple prior probability distribution $p_{\mathcal{Z}}$ on a latent variable $z\in\mathcal{Z}$ , a bijection $f:\mathcal{X}\to\mathcal{Z}$ , the change of variable formula defines a model distribution on $\mathcal{X}$ as shown in Equation (1).

[TABLE]

where $\frac{\partial f(x)}{\partial x}$ is the Jacobian of $f$ at $\mathbf{x}$ . The log-likelihood objective is then equivalent to minimizing:

[TABLE]

Since the data $\mathbf{x}$ are discrete data, we add a random uniform noise $u\in\mathcal{U}(0,a)$ , where $a$ is determined by the discretization level of the data, to make $\mathbf{x}$ be continuous data. The generative process can be defined as Equation (3).

[TABLE]

The bijection function $f$ is constructed from a sequence of invertible and tractable Jacobian determinant transformations: $f=f_{1}\circ f_{2}\circ...\circ f_{K}$ ( $K$ is the number of transformations). Such a sequence of invertible transformations is also called a normalizing flow. Here, Equation (2) can be written as in Equation (4).

[TABLE]

where $\mathbf{h}_{k}=f_{1}\circ f_{2}\circ..\circ f_{k}(\mathbf{h}_{0})$ with $\mathbf{h}_{0}=\mathbf{x}$ .

3.2 Standard $n\times n$ Convolution

In this section, we revisit the standard $n\times n$ convolution. Let $\mathbf{X}$ be an $C\times H\times W$ input; $\mathbf{W}$ is a $D\times C\times K$ kernel, and the convolution can be expressed as follows:

[TABLE]

where $\mathbf{X}^{k}_{:,:,:}$ is a $C\times H\times W$ matrix that represents a spatially shifted version of input matrix $\mathbf{X}$ with shift amount $(i_{k},j_{k})$ , . ${\mathbf{W}}_{:,:,k}$ represents the $D\times C$ matrix corresponding to the kernel index $k$ , the symbol $\star$ denotes a convolution operator.

In Equation (5), the standard convolution is simply a sum of $1\times 1$ convolutions on shifted inputs. The function $\mathcal{S}_{k}$ maps the input $\mathbf{X}$ to the corresponding shifted input $\mathbf{X}^{k}_{:,:,:}$ . The standard convolution uses the common shifted input with integer-valued shift amounts for index $k$ . Figure 2 illustrates our reformulated $n\times n$ convolution, if we can share the shifted inputs regardless of the kernel index, especially $\mathcal{S}_{k}(\mathbf{X})=\mathcal{S}(\mathbf{X})$ , the standard convolution will be simplified as the $1\times 1$ convolution as shown in Equation 6. In this paper, we propose a shift function $\mathcal{S}$ , which is an invertible and tractable form of the Jacobian determinant.

[TABLE]

4 Invertible $\boldmath{n}\times\boldmath{n}$ Convolution

In this section, we first introduce our proposed Invertible Shift Function and then present invertible $n\times n$ convolution in details.

4.1 Invertible Shift Function

The shift function $\mathcal{S}$ will approximate all shifted input $\mathbf{X}^{k}_{:,:,:}$ ( $1\leq k\leq K$ ). Here, we propose to design $\mathcal{S}$ as a linear transformation per channel; specifically, we have learnable variables $\alpha_{c},\beta_{c}$ ; $1\leq c\leq C$ are scale and translation parameters for each channel, respectively. The shift function $\mathcal{S}$ can be formulated as follows:

[TABLE]

where $c,i,j$ are the depth channel index and spatial indices, respectively. The reverse function of $\mathcal{S}$ can be easy to obtain:

[TABLE]

Thanks to Equation (7), the value of $\mathcal{S}(\mathbf{X}_{c,i,j})$ only depends on $\mathbf{X}_{c,i,j}$ and the Jacobian matrix will be in the form of the diagonal matrix as follows:

[TABLE]

Therefore, the determinant of Equation (9) is the product of all elements in the diagonal of the matrix $\mathbf{J}$ as in Equation (10).

[TABLE]

4.2 Invertible $n\times n$ Convolution

Kingma Kingma and Dhariwal (2018) proposed invertible $1\times 1$ convolution as the smart way to learn the permutation matrix instead of the fixed permutation Dinh et al. (2015, 2017). However, the $1\times 1$ suffers from limited flexibility compared to the standard convolution. In particular, the receptive fields of $1\times 1$ convolution is limited. When the network goes deeper, the receptive fields of $1\times 1$ convolutions are still small areas; these, therefore, cannot generalize or model large objects of high-dimensional data. However, the $1\times 1$ convolution has its own advantages compared to the standard convolution. First, the $1\times 1$ convolution allows the network to compress the data of the input volume to be smaller. Second, $1\times 1$ suffers less over-fitting due to small kernel sizes. Therefore, in our proposal, we still take advantages of the $1\times 1$ convolution. Specifically, we adopt the successfully invertible $1\times 1$ convolution of Glow Kingma and Dhariwal (2018) in our design.

In the previous subsection, we proved that the shift function $\mathcal{S}$ is invertible and proved the tractability of the Jacobian determinant. In Section 3.2, we indicated that if we can share shifted inputs regardless of the kernel index via the shift function $\mathcal{S}$ , we can simplify the standard $n\times n$ convolution to the composition of the $\mathcal{S}$ and $1\times 1$ convolution. Therefore, the invertible $n\times n$ convolution will be equivalent to the combination of the invertible shift function $\mathcal{S}$ and the invertible $1\times 1$ convolution. Specifically, the input will first be forwarded to the shift function $\mathcal{S}$ and then convoluted with the $1\times 1$ filter. Algorithm 1 illustrates the pseudo code of the invertible $n\times n$ convolution.

Figure 3a illustrates our one step of flow. We adopt the common design of a flow step Kingma and Dhariwal (2018); Hoogeboom et al. (2019); Truong_EUNVP in our design. Our proposal can be easily integrated to the multi-scale architecture designed by Dinh et al. Dinh et al. (2017) (Figure 3b). From our proposal, we can generalize the invertible $1\times 1$ convolution to the invertible $n\times n$ convolution through the shift function $\mathcal{S}$ . It can help to encourage the filters to learn a more efficient data representation and embed more useful latent features than the invertible $1\times 1$ convolution used in Glow Kingma and Dhariwal (2018). Besides, we use fewer parameters and have less inference time compared to the standard $n\times n$ convolutions.

5 Experiments

In this section, we present our experimental results on CIFAR-10, ImageNet and Celeb-HQ datasets. Firstly, in Section 5.1, we compare log-likelihood against the previous flow-based models, that is, RealNVP Dinh et al. (2017), Glow Kingma and Dhariwal (2018) and Emerging Convolution Hoogeboom et al. (2019). Finally, in Section 5.2, we show our qualitative results trained on the Celeb-HQ dataset.

5.1 Quantitative Experiments

**Datasets and Metric: ** We evaluate our invertible $n\times n$ convolution on CIFAR-10 (Figure 4a) and ImageNet (Figure 4b) with $32\times 32$ and $64\times 64$ image sizes. We use bits per dimension as the criteria with which to evaluate models. We compare our method against RealNVP Dinh et al. (2017), Glow Kingma and Dhariwal (2018) and Emerging Convolution Hoogeboom et al. (2019). We adopt the network structures of Glow and replace all invertible $1\times 1$ convolutions of Glow with our invertible $n\times n$ convolutions. For the data preprocessing, we follow the same process as in RealNVP Dinh et al. (2017).

{paracol}

2 \switchcolumn

Network Configurations: In the CIFAR experiment, the depth of flow $K$ and the number of levels $L$ are set to $32$ and $3$ , respectively. Meanwhile, the depth of flow in ImageNet experiments is set to $48$ , the numbers of levels of ImageNet $32\times 32$ and ImageNet $64\times 64$ experiments are set to $3$ and $4$ , respectively. We use the Adam optimizer Kingma and Ba (2015) to optimize the networks in which batch size and learning rate are set to $64$ (per GPU) and $0.001$ , respectively. We choose Normal Distribution as the prior distribution $p_{\mathcal{Z}}(z)\sim\mathcal{N}(\mathbf{z};0,\mathbf{I})$ in all experiments.

The shift function $\mathcal{S}$ will be not inverse if the $\alpha_{c}=0$ ( $\exists\;c\in[1...C]$ ). Hence, in the training process, we will first initialize $\alpha_{c}=1\text{ and }\beta_{c}=0$ ( $1\leq c\leq C$ ). During the learning processing, we keep $\alpha_{c}$ ( $1\leq c\leq C$ ) as a different [math] to guarantee that the shift function $\mathcal{S}$ is inverse and to guarantee the tractability of the Jacobian determinant. Training models on high-dimensional data requires large memory. To be able to train with a large batch size, we simultaneously and distributively trained the models on four GPUs via Horovod (https://github.com/horovod/horovod) and TensorFlow (https://tensorflow.org) frameworks.

Results: Table 5.1 shows our experimental results. In particular, our proposal helps to improve the generative models on ImageNet $32\times 32$ and ImageNet $64\times 64$ datasets, which are more challenging than CIFAR-10. In particular, our proposed method achieves a state-of-the-art performance in which the bit per dimension results in ImageNet $32\times 32$ , and ImageNet $64\times 64$ is 3.96 and 3.74, respectively. In comparison, the Emerging Convolution Hoogeboom et al. (2019) and Glow achieve similar results in both ImageNet $32\times 32$ and ImageNet $64\times 64$ benchmarks, which are 4.09 and 3.81, respectively. Meanwhile, the corresponding results of RealNVP on these benchmarks are 4.28 and 3.98, respectively. As shown by the results, our proposed invertible $n\times n$ convolution provides a better generative capability than the stand-alone invertible $1\times 1$ convolution. Since Emerging Convolution uses invertible auto-regressive convolution, our proposal is, therefore, less complicated and has faster inference than Emerging Convolution. In the CIFAR-10 benchmark, although our model does not perform as well as Glow Kingma and Dhariwal (2018) and Emerging Convolution Hoogeboom et al. (2019), we find it interesting that our method gains competitive results with a small number of modifications. The gap in performance is partially caused by the small amount of CIFAR-10 data that is inefficient for training the well-generalized convolution.

{specialtable}

[H] Comparative results (bits per dimension) of proposed invertible $n\times n$ convolution compared to RealNVP, Glow and Emerging Convolution.

** Models** ** CIFAR-10** ** ImageNet 32** ** ImageNet 64**

RealNVP 3.49 4.28 3.98

Glow 3.35 4.09 3.81

Emerging Conv 3.34 4.09 3.81

Ours 3.50 3.96 3.74

5.2 Qualitative Experiments

The CelebA-HQ dataset Liu et al. (2015) was selected to train the model using the architectures defined in the previous section with a higher resolution ( $256\times 256$ image sizes). The depth of flow $K$ and the number of levels $L$ were set to $32$ and $6$ , respectively. Since high-dimensional data requires large memory, we reduced the batch size to $1$ (per GPU) and trained on eight GPUs. The qualitative experiment aims to study the efficiency of the model when it scales up to the high-resolution images, synthesizes realistic images, and provides the meaningful latent space. Figure 4c shows the examples of Celeb-HQ datasets. We trained our model on 5-bit images in order to improve visual quality with a slight trade-off of color fidelity. As shown by the synthetic images in Figure 5, our model can generalize realistic images in high dimensional data.

{paracol}

2 \switchcolumn

6 Conclusions and Future Work

This paper has presented a novel invertible $n\times n$ convolution approach. By reformulating the convolution layer, we propose to use the shift function to shift inputs instead of kernels. We prove that our shift function is invertible and tractable in terms of calculating the Jacobian determinant. The method leverages the shift function and the invertible $1\times 1$ convolution to generalize to the invertible $n\times n$ convolution. Through experiments, our proposal has achieved state-of-the-art results in quantitative measurement and is able to generate realistic images with high-resolution.

There are several challenges that remain to be addressed in future work. In particular, when the model scales up to the high-resolution images, it requires a large amount of GPU memory during the training process, that is, the back-propagation process. Maintaining the rotation matrix property for the invertible $1\times 1$ convolution when training the model on a large dataset is also a challenging task, since the model easily falls into the non-inverse matrix due to the stochastic gradient update of the back-propagation algorithm. That issue is interesting work and should be improved in the future.

\authorcontributions

Conceptualization: T.-D.T. and C.N.D. Methodology: T.-D.T., C.N.D. and K.L. Review and Editing: K.L., M.-T.T., and N.L. Supervision: K.L. and M.-T.T. All authors have read and agreed to the published version of the manuscript.

\dataavailability

CIFAR Dataset https://www.cs.toronto.edu/~kriz/cifar.html, ImageNet dataset https://image-net.org/, and CelebA-HQ Dataset https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.

\conflictsofinterest

The authors declare no conflict of interest.

\reftitle

References

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1He et al. (2016) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
2Huang et al. (2017) Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
3Sun et al. (2018) Sun, S.; Pang, J.; Shi, J.; Yi, S.; Ouyang, W. Fish Net: A Versatile Backbone for Image, Region, and Pixel Level Prediction. Adv. Neural Inf. Proc. Syst. 2018, 760–770.
4Liu et al. (2016) Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
5Redmon et al. (2016) Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
6Lin et al. (2017) Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy 22–29 October 2017; pp. 2980–2988.
7(7) Luu, K., Seshadri, K., Savvides, M., Bui, T. & Suen, C. Contourlet Appearance Model for Facial Age Estimation. Intl. Joint Conf. On Biometrics (IJCB) . pp. 1-7 (2011)
8(8) Le, H., Seshadri, K., Luu, K. & Savvides, M. Facial Aging and Asymmetry Decomposition Based Approaches to Identification of Twins. Journal Of Pattern Recognition . 48 pp. 3843-3856 (2015)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

keywords:

1 Introduction

2 Related Work

3 Background

3.1 Flow-Based Generative Model

3.2 Standard n×nn\times nn×n Convolution

4 Invertible \boldmathn×\boldmathn\boldmath{n}\times\boldmath{n}\boldmathn×\boldmathn Convolution

4.1 Invertible Shift Function

4.2 Invertible n×nn\times nn×n Convolution

5 Experiments

5.1 Quantitative Experiments

5.2 Qualitative Experiments

6 Conclusions and Future Work

3.2 Standard $n\times n$ Convolution

4 Invertible $\boldmath{n}\times\boldmath{n}$ Convolution

4.2 Invertible $n\times n$ Convolution