Prediction and Sampling with Local Graph Transforms for Quasi-Lossless Light Field Compression
Mira Rizkallah, Thomas Maugey, Christine Guillemot

TL;DR
This paper introduces sampling and prediction schemes with local graph transforms to efficiently compress light fields by capturing both local and long-term dependencies, achieving near lossless results.
Contribution
It proposes novel sampling and prediction methods with local graph transforms that effectively exploit long-range dependencies in high-dimensional light field data.
Findings
High compression efficiency demonstrated for light fields.
Effective exploitation of long-term dependencies beyond local support.
Suitable for quasi-lossless light field compression.
Abstract
Graph-based transforms have been shown to be powerful tools in terms of image energy compaction. However, when the support increases to best capture signal dependencies, the computation of the basis functions becomes rapidly untractable. This problem is in particular compelling for high dimensional imaging data such as light fields. The use of local transforms with limited supports is a way to cope with this computational difficulty. Unfortunately, the locality of the support may not allow us to fully exploit long term signal dependencies present in both the spatial and angular dimensions in the case of light fields. This paper describes sampling and prediction schemes with local graph-based transforms enabling to efficiently compact the signal energy and exploit dependencies beyond the local graph support. The proposed approach is investigated and is shown to be very efficient in the…
| Light Fields | Energy Percentage in | Energy Percentage in |
|---|---|---|
| Flower 2 | 99.15 % | 99.02 % |
| Cars | 99.27 % | 99.34 % |
| Rock | 98.63 % | 98.45 % |
| Seahorse | 99.17 % | 98.73 % |
| Stone Pillars Inside | 98.90 % | 98.26 % |
| Friends | 99.76 % | 99.80 % |
| Light Fields | HEVC-Intra coding of set of reference samples | Entropy coding of |
|---|---|---|
| Flower 2 | 1.23 Mbits | 1.59 Mbits |
| Cars | 1.45 Mbits | 1.64 Mbits |
| Rock | 1.31 Mbits | 1.49 Mbits |
| Seahorse | 0.74 Mbits | 1.38 Mbits |
| Stone Pillars Inside | 1.92 Mbits | 1.84 Mbits |
| Friends | 1.86 Mbits | 2.07 Mbits |
| Light Fields | HEVC-Intra of reference view | Entropy coding of |
|---|---|---|
| Flower 2 | 1.0 Mbits | 1.48 Mbits |
| Cars | 1.06 Mbits | 1.54 Mbits |
| Rock | 1.02 Mbits | 1.38 Mbits |
| Seahorse | 0.72 Mbits | 1.22 Mbits |
| Stone Pillars Inside | 1.81 Mbits | 1.66 Mbits |
| Friends | 1.62 Mbits | 2.04 Mbits |
| Light Fields | HEVC-Inter (QP=0) Raster Scan | Non Separable Scheme (Q = 0.5) | Non Separable Scheme (Q = 1) | Separable Scheme (Q = 1) |
|---|---|---|---|---|
| Flower 2 | 3.3129 bpp (54.2033 dB) | 2.4470 bpp (60.4656 dB) | 2.4457 bpp (52.9393 dB) | 2.4799 bpp (55.1969 dB) |
| Cars | 3.6688 bpp (54.0812 dB) | 2.7759 bpp (60.5035 dB) | 2.7801 bpp (53.0268 dB) | 2.6258 bpp (55.2009 dB) |
| Rock | 3.2700 bpp (53.7601 dB) | 2.0423 bpp (60.2994 dB) | 2.0545 bpp (52.6230 dB) | 2.0162 bpp (54.7765 dB) |
| Seahorse | 2.4751 bpp (54.3804 db) | 1.8224 bpp (60.4474 dB) | 1.7849 bpp (53.0111 dB) | 1.9762 bpp (55.2844 dB) |
| Stone Pillars Inside | 4.9017 bpp (52.1036 dB) | 2.5559 bpp (59.7134 dB) | 1.5269 bpp (52.3953 dB) | 3.3094 bpp (55.0022 dB) |
| Friends | 3.5400 bpp (52.7986 dB) | 1.9327 bpp (59.7657 dB) | 1.9311 bpp (52.4402 dB) | 2.4436 bpp (54.8196 dB) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Fusion Techniques · Advanced Data Compression Techniques · Advanced Vision and Imaging
Prediction and Sampling with Local Graph Transforms for Quasi-Lossless Light Field Compression
Mira Rizkallah, Thomas Maugey, and Christine Guillemot,
INRIA Rennes Bretagne Atlantique, Rennes, France This work has been supported by the EU H2020 Research and Innovation Programme under grant agreement No 694122 (ERC advanced grant CLIM).
Abstract
Graph-based transforms have been shown to be powerful tools in terms of image energy compaction. However, when the support increases to best capture signal dependencies, the computation of the basis functions becomes rapidly untractable. This problem is in particular compelling for high dimensional imaging data such as light fields. The use of local transforms with limited supports is a way to cope with this computational difficulty. Unfortunately, the locality of the support may not allow us to fully exploit long term signal dependencies present in both the spatial and angular dimensions in the case of light fields. This paper describes sampling and prediction schemes with local graph-based transforms enabling to efficiently compact the signal energy and exploit dependencies beyond the local graph support. The proposed approach is investigated and is shown to be very efficient in the context of spatio-angular transforms for quasi-lossless compression of light fields.
Keywords Light Fields, Energy Compaction, Transform coding, Super-rays, Graph Fourier Transform, Prediction, Sampling
1 Introduction
A graph is a useful tool to describe intrinsic image structures. The graph can then be used as a support for defining and computing de-correlation transforms, which is a critical step in image compression. Fourier-like transforms, called graph Fourier transform (GFT) [1] and many variants [2, 3, 4, 5, 6, 7] have been shown to be powerful tools for coding piecewise smooth and natural 2D images. An interesting review can be found in [8]. However, when the dimension of the signal increases, the dimension of the graph increases and the complexity inherent to the computation of the GFT basis functions rapidly becomes untractable. This is obviously the case for light field data making a complete graph connecting all light rays unsuitable for this task.
To cope with this difficulty, we consider instead local transforms with limited supports. In order to take into account the scene geometry, the support of the graph is defined by super-rays. Super-rays have been first introduced in [9] as an extension of super-pixels in the 3D domain to group light rays coming from the same 3D object, i.e. to group pixels having similar color values and being close spatially in the 3D space. While the locality of the support allows us to reduce the computation complexity of the basis functions, it does not allow us to capture long term spatial dependencies of the signal, unlike efficient predictive schemes used in state of the art codecs (e.g. HEVC). The correlation between different super-rays is not exploited.
In this paper, we introduce sampling and prediction schemes to exploit correlation beyond the limits of the local graph transform support. More precisely, based on the graph sampling theory, the proposed methods allow taking advantage of the good energy compaction property of graph transforms on local supports, i.e. with a limited complexity, while benefiting from well established but powerful prediction mechanisms in the pixel domain. The idea is to first sample the light field data and to encode these references samples with any image coder having powerful Intra prediction mechanisms. The local graph transform is then computed, but only its high frequency coefficients are coded and transmitted. We derive the equations allowing us to recover the low frequency coefficients of the local graph transforms from its coded high frequency coefficients and from the encoded reference samples. The encoding of these reference samples is a way to efficiently encode the low frequency coefficients containing most of the light field energy, using Intra prediction mechanisms of state-of-the-art coders. In the experiments, we used HEVC Intra (HM 16.10).
In this general framework, one key question to address is the best choice of the reference samples. The most natural way would be to take all the pixels of a reference view. However, due to matrix conditioning problems, that we will discuss in the paper, the recovered low frequencies are, in that case, very sensitive to high frequencies coefficients quantization. In order to overcome this issue, we sample the graph in each super-ray, across views, and project the samples into one reference image. Although this approach gives good performance in terms of energy compaction and quasi-lossless compression of the light field, using a complete graph per super-ray still suffers from complexity limitations and high sensitivity to the quantization noise present in the high frequency coefficients.
To further decrease the basis function computational complexity, we then consider separable local graph transforms applying first a spatial followed by an angular transform. Unlike in the non separable case, the prediction equations do not suffer from numerical instabilities and from the presence of quantization noise in the high frequencies. The reference samples can thus be taken from a light field view. This second approach keeps the advantages of both the reduced basis function computational complexity due to the limited support and of the structured set of reference samples (one entire view) that can be easily coded with intra-prediction mechanisms. It however keeps only in part the advantage of the energy compaction of the graph transform since the recovered frequencies do not necessarily correspond to the low frequencies. This second approach is partly based on the spatio-angular prediction scheme described in [10] for non separable spatio-angular graph transforms.
The proposed methods can be seen as graph-based prediction schemes deriving low frequency spatio-angular coefficients from one single compressed reference image (e.g. the projected set of reference samples in the non-separable case, or the top-left view in the separable case) and from the high frequency coefficients. The methods have been assessed in the context of quasi-lossless encoding of light fields. Experimental results show that, when coupled with a powerful intra-prediction tool, the graph-based spatio-angular prediction brings a substantial gain in bitrate reaching almost .
2 Related work
In this section, we first review the basics of graph-based transforms and of graph sampling theory. We then give a quick overview of the approaches considered so far for light field compression.
2.1 Graph transforms
A graph has been shown to be a useful tool to describe the intrinsic image structure, hence to capture correlation, which is necessary for image compression. An interesting review of graph spectral image processing can be found in [8].
For image compression, the signal is defined on an undirected connected graph which consists of a finite set of vertices corresponding to the pixels. A set of edges connect each pixel and its 4-nearest neighbors in the spatial domain. By encoding pixel similarities into the weights associated to edges, the undirected graph encodes the image structure. A Fourier-like transform for graph signals called graph Fourier transform (GFT) [1] and many variants [2, 3, 4, 5, 6, 7] have been used as adaptive transforms for coding piecewise smooth and natural images.
A spectrum of graph frequencies can be defined through the eigen-decomposition of the graph Laplacian matrix defined as , where is a diagonal degree matrix whose diagonal element is equal to the sum of the weights of all edges incident to node . The matrix is the adjacency matrix with entries , if there is an edge between two vertices and , and otherwise.
The Laplacian matrix is symmetric positive semi-definitive and therefore can be diagonalized as:
[TABLE]
where is the matrix whose rows are the eigenvectors of the graph Laplacian and is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. The eigenvectors of the Laplacian of the graph are analogous to the Fourier bases in the Euclidean domain and allow representing the signals residing on the graph as a linear combination of eigenfunctions akin to Fourier Analysis [1]. This is known as the Graph Fourier transform.
2.2 Graph Sampling
Let us consider a graph made of vertices associated with a Laplacian . It has a complete set of eigenvalues and eigenvectors . A graph signal is bandlimited and has a bandwidth if it can be expressed as a linear combination of only the first eigenvectors of . The space of -bandlimited signals is called a Paley-Wiener space and is denoted as .
A subset of vertices is a uniqueness set [11] for signals in if . It is also shown that is a uniqueness set for all signals , if and only if are linearly independent where is the smallest eigenvalue of and is a reduced eigenvector. The term reduced implies taking the rows of the eigenvectors corresponding to the indices of the sampling set [12]. It can also be shown that for any minimum uniqueness set of size for signals in , there is always at least one node such that is a uniqueness set of size for signals in [12]. This property will be useful for iteratively selecting the set of reference samples from the input light field data.
After building a uniqueness set, a simple way to reconstruct the missing samples is to solve a least-squares problem in the spectral domain [11]. Observing that the signal can be written as
[TABLE]
the vector can be retrieved by searching for the least square solution of the upper part of the above system as:
[TABLE]
then the missing samples are reconstructed as follows:
[TABLE]
where columns of are the first eigenvectors of the .
In the special case where is of size ( is therefore a minimum uniqueness set [12] for signals ), is a square invertible matrix. Equipped with the aforesaid arguments, the formulation in Equation 12 can be further simplified to:
[TABLE]
While the aforementioned sampling theorem [11] has been proposed for band-limited signals, we extend those equations to our problem in the following section. More precisely, we deal with signals (i.e. Color Signals) that might not be necessarily band-limited on the underlying graph supports (i.e. Super-Rays).
2.3 Light field compression
The availability of commercial light field cameras has given momentum to the development of light field compression algorithms. Many solutions proposed so far adapt standardized image and video compression solutions (in particular HEVC) to light field data. This is the case e.g. in [13, 14, 15, 16], where the authors extend HEVC intra coding modes by adding new prediction modes to exploit similarity between lenslet images. This is also the case in [17, 18, 19], where the views are encoded as pseudo video sequences using HEVC or the latest JEM software, or in [20] where HEVC is extended for coding an array of views.
Low rank models as well as local Gaussian mixture models in the 4D rays space are proposed in [21], [22] and [23] respectively. View synthesis based predictive coding has also been investigated in [24] where the authors use a linear approximation computed with Matching Pursuit View synthesis based predictive coding is another research direction followed in [24] where the authors use a linear approximation computed with Matching Pursuit for disparity based view prediction. The authors in [25] and [26] use instead a the convolutional neural network (CNN) architecture proposed in [27] for view synthesis and prediction. The prediction residue is then coded using HEVC [25], or using local residue transforms (SA-DCT) and coding [26]. The authors in [28], use a depth based segmentation of the light field into 4D spatio‐angular blocks with prediction followed by JPEG‐2000. View synthesis followed by predictive coding is the approach followed in JPEG-Pleno [29]. While all prior work mentioned above has been dedicated to lossy compression, much less effort has been dedicated to lossless coding of light fields. One can however mention the approach proposed in [30] using differential prediction.
3 Super-ray based graph transforms
Let us consider the 4D representation of light fields proposed in [31] and [32] describing the radiance along rays by a function . Based on this representation, the light field of dimensions can be regarded as an array of views at angular positions , each view being composed of pixels with spatial coordinates . In the sequel, to denote one view, we will use the pair of indices or an index to simplify the notations.
Such light fields represent very large volumes of high dimensional data. Graphs connecting all light rays spatially and across views can rapidly become untractable, and in particular the computation of the basis functions, if we consider one unique graph for the entire light field. To overcome this difficulty, here we consider graphs with limited supports defined by super-rays in order to follow the scene geometry.
3.1 Super-ray construction
The concept of super-ray has been initially introduced in [9] as an extension of super-pixels to address the computational complexity issue in light field image processing tasks. The term super-pixel, first coined in [33] can be seen as the clustering of image pixels into a set of perceptually uniform regions. Similarly, super-rays can be seen as the clustering of rays of the light field within and across views, hence corresponding to the same set of 3D points of the imaged scene.
To construct super-rays, we proceed as follows. We first compute super-pixels in the top-left view using the SLIC algorithm [34] as well as its disparity map using the method in [35]. An example of an original image with its segmentation is shown in Fig 1.
Then, using the disparity map, we compute the median disparity per super-pixel and use this median disparity to project the segmentation labels to all the other views. The algorithm proceeds row by row. In the first row of views, we perform horizontal projections from the top-left to the views next to it. For each other row of views, a vertical projection is first carried out from the top view to recover the segmentation on view , then horizontal projections from to the other views are performed. At the end of each projection, some labels are projected in all the views without interfering with others. Those typically represent flat regions inside objects. Others mainly consisting of occluded and occluding segments end up superposed in some views. In this case, the occluded pixels are assigned the label of the neighboring super-ray corresponding to the foreground objects (i.e. having the higher disparity). As for appearing pixels, they are clustered with the background super-rays (i.e. having the lower disparity.
3.2 Local graph transforms
We will denote the luminance values of all the light rays (i.e. pixels across all the views) in the super-ray, by the vector , where is the number of rays in the super-ray. The super-ray is formed by a set of super-pixels (corresponding super-pixels across the different views). Each super-pixel forming the super-ray will be denoted; in a vectorized form, .
We build a 4-connected graph inside each super-ray i.e. each pixel is connected to its 4 nearest neighbors (horizontal and vertical neighbors). The graph transform of the super-ray is defined as
[TABLE]
Where the columns of are the eigenvectors of the local graph laplacian inside the super-ray. The inverse graph Fourier transform is then given by
[TABLE]
However, computing the transform on a local support does not allow us to exploit spatial signal dependencies outside the support, resulting in some loss in compression efficiency. To exploit these dependencies, some form of prediction across super-rays would be needed. Nevertheless, the super-rays being of arbitrary shapes, developing inter super-ray prediction mechanisms is not an easy task. The idea we develop here consists instead in encoding a selected set of samples, using powerful prediction mechanisms available in state-of-the-art coders (e.g. HEVC), and then to recover the low frequency coefficients of the local graph transforms from its coded high frequency coefficients and the encoded reference samples as seen in Fig 2.
4 Super-ray based graph prediction and sampling
4.1 Graph-based prediction
Let us denote the set of pixel indices in a super-ray and those belonging to the sampling set. Let be the set of all other pixels indices of the super-ray. Let be the cardinal of . We denote the set of lowest frequency coefficients and the rest of the frequency coefficients.
Due to the high level of correlation between the different pixels forming a super-ray , the energy of the transformed coefficients is highly compacted in the low frequencies . However, we might still end up with some non-zero high frequencies . If we choose an appropriate uniqueness sampling set in the super-ray, then the inverse graph transform is defined under appropriate permutation as
[TABLE]
i.e., as
[TABLE]
If the signal samples are transmitted separately, is available at the decoder. If we impose , then is a square invertible matrix. Furthermore, if we only transmit , then we are able to recover from the following equation:
[TABLE]
Equation (19) is our so-called graph-based spatio-angular prediction. First, can be seen as a signal composed of a -band-limited part plus some high frequencies. In this equation, we are actually removing the high frequencies to retrieve the band-limited signal (i.e. ). Using the least squares reconstruction method in (12), we find the low frequency transformed coefficients .
Moreover, the high-frequency coefficients can be also seen as prediction coefficients, transmitted to recover the exact light field at the decoder. The basis of the linear prediction is the graph-transform basis, which makes these coefficients low-energetical and thus easy to transmit.
The signal values at are then retrieved as
[TABLE]
where the first term is equivalent to the -band-limited signal recovered on and the second term is added in order to take into account the high frequency components.
To be able to carry out our graph-based spatio-angular prediction, we should at first determine the appropriate sampling set. More precisely, we want to find that results in the best conditioning of the sub-matrix that guarantees a small reconstruction error. Simultaneously, we seek a sampling set that can be wrapped onto one single view to be coded with efficient prediction mechanisms.
4.2 Graph sampling for light fields
A first intuitive way to define the sampling set per super-ray would be to choose the set of pixels that reside in the reference view that can be subsequently coded with intra HEVC. In our experiments however, we have found that the resulting sub-matrix is ill-conditioned for non-consistent super-rays. We choose instead to use an adapted version of the algorithm described in [12].
More precisely, for each super-ray , we specify the band-limit frequency as with . We seek to find the optimal sampling set that guarantees the exact reconstruction of any signal in . We know that we have a correspondence between the size of the minimum uniqueness set and the signal bandwidth. We therefore want to find a set of samples. In order to find the vertices that belong to this set, we have to find linearly independent rows from the matrix . We follow the same reasoning as in [12] but with slightly different constraints to adapt it to our coding problem. In summary, the algorithm takes as input the graphs of all super-rays and the number of samples per super-ray. At the output of this stage, we want to wrap all the samples in a reference view to be efficiently coded with HEVC.
While this method allows an optimal sampling per super-ray, yet, it does not guarantee that the output vector is well structured. It is impossible to say that the samples of neighboring super-rays will be efficiently de-correlated using intra-prediction mechanisms of any efficient coder. In extreme cases, we might end up with noisy samples that are very difficult to code. We thus propose to wrap our samples into one reference view taking into account the geometrical information given by our local graph.
We first observe that our graph laplacian is a sum of two laplacians: The first one includes the connections (s for spatial) inside views, and the other (a for angular) made of edges between pixels inside different views. is actually composed of various connected components, each one corresponding to a 3D point in the scene.
Using the angular information provided by , we define the matrix of size where each element gives the correspondence between a pixel in a super-ray in , and any other pixel in the super-ray. Consider a pixel in the view . If we can access a pixel from following the graph connections in then the entry , otherwise .
For each sample corresponding to a point , we find the corresponding point in the set of pixels belonging to the super-ray in the first view i.e. such as . The best case scenario is when each sample has a correspondence to a different pixel in the first view. In this case, the projection is easy following the graph links. In the worst case, more than one sample might have a correspondence with the same point in the first view. In this case, first found, first served. The others are considered as disocclusions, and the pixels having no correspondence in the first view, will be projected into remaining available positions. The complete algorithm is summarized in Algorithm 1.
Examples of images obtained after the sampling and projection of the selected sample set on a reference view (considering here the Luminance pixel values) are shown in Figure 3. Despite the non-optimality of this method, we have ascertained that the selected set of reference samples can be efficiently compressed with HEVC using lossless compression settings. Once we have the samples in hand, they can be sent as prediction information to the decoder side, instead of sending the low frequency coefficients. This strategy is a way to efficiently compress the low frequency graph transform coefficients containing about 99 of the light field energy, as shown in Table 1 (left column).
5 Prediction based on spatio-angular separable graph transforms
To further decrease the basis function computational complexity, we now consider the case of a separable spatio-angular transform, i.e., applying first a spatial followed by an angular transform, where the set of reference samples is one reference view. We will see that, in addition to a reduced complexity, unlike in the non separable case, the prediction equations do not suffer from numerical instabilities and from the presence of quantization noise in the high frequencies.
5.1 Separable spatio-angular graph transform
The spatial graph transform coefficients for each spatial graph are obtained by calculating:
[TABLE]
where are the eigenvectors of the spatial laplacian and are the luminance values of the super-ray in view . Inversely, the luminance values of the pixels belonging to the graph are retrieved from
[TABLE]
For each super-ray , an angular transform is then used to tract similarities between the transformed coefficients of each band of the spatial transform coefficients , across the views . For that, the spatial-band vector is denoted , where is the total number of views. The angular transform coefficients are then obtained by calculating
[TABLE]
where is a matrix whose columns are the eigenvectors of the laplacian of the angular graph for the band .
5.2 Separable graph-based spatio-angular prediction
Let us assume that view is coded as a reference. In order to perform the prediction, we follow the same reasoning as in the previous (non separable graph) case but we apply it to each band that exists in view . For a given super-ray , the spatial transform in view is according to notations introduced before.
We choose one sample for each band. It corresponds to the vertex that is in the reference view (labeled by in our case). For a given band , the inverse angular transform is defined as
[TABLE]
i.e., as
[TABLE]
where denotes the number of views where the band of the super-ray is defined. Since the view is transmitted separately, is available at the decoder. If we only transmit , the we are able to retrieve from the following equation:
[TABLE]
Equation (26) is our graph-based spatio-angular prediction for the separable case. The spatial coefficients of all the views are then retrieved from the following equation
[TABLE]
Once the decoder has recovered the first spatial graph transform coefficients in all the views, it can reconstruct the whole light field by applying a simple spatial inverse GFT since it has access to the graph supports and coefficients.
6 Proposed coding schemes
6.1 Coding scheme with the non separable graph transform
Fig.4 gives an overview of the coding scheme for the non separable case. The top left view is separated into uniform regions using the SLIC algorithm ([34]) to segment the image into super-pixels, and its disparity map is estimated with [35]. The disparity values are encoded using simple arithmetic coder. The segmentation is coded with edge arithmetic coder(AEC) [36]. Using both the segmentation map and the geometrical (disparity) information, we can build consistent super-rays and graphs in and across all views, as explained above, at both the encoder and decoder sides. Once the local graphs are computed, we can find the optimal sampling sets (their actual positions in the light field and the corresponding luminance values) as explained in 4.2. Those samples are reorganized in a reference image coded with HEVC intra and sent as prediction information to the decoder.
We apply the non separable graph transform on the coded version of the reference image (quasi-lossless coding) and the original values of all other samples to compact their energy in fewer coefficients. Since the reference image is coded with very small QP, we are almost sure that we are not adding angular incoherence between the different views. Once we have the graph transform coefficients, instead of sending the whole spectrum with simple arithmetic coding, we propose to use the proposed graph-based prediction to derive the low frequency spatio-angular coefficients from the coded reference set of samples and high angular frequency coefficients, at the decoder side.
We thus send, for all super-rays, the AC coefficients, i.e., the () last bands obtained with the non separable graph transform. ( and are the number of pixels belonging to the super-ray and those only residing in view respectively). Specifically, after applying the spatio-angular graph transforms on all super-rays, all frequency coefficients are grouped into a two-dimensional array where is the transformed coefficient for the super-ray . Using the natural scanning order (increasing order of eigenvalues), we assign a class number to each super-ray. For a class , the high frequencies are defined as the last coefficients where is the total number of coefficients. Each super-ray belongs to class if it does not belong to class and the mean energy per high frequency is less than . More precisely, we start by finding the super-rays in the first class then remove them from the search space before finding the other classes, and similarly for the following steps. We code a flag with an arithmetic coder to give the information of the class of super-rays to the decoder side. In class , the last coefficients of each super-ray are discarded. The remaining high frequency spatio-angular coefficients are quantized uniformly with a small step size . They are grouped into uniform groups to be arithmetically encoded.
6.2 Coding scheme with the separable graph Transform
The major difference between the coding scheme in the separable case (see Fig.5), compared with the non separable case, resides in the fact that the set of reference samples coded using HEVC-Intra is the top-left view. From this reference view, and the corresponding disparity map, that are transmitted, the decoder can compute the segmentation into super-pixels using the SLIC algorithm and then derive the super-rays used constructing the local graphs and compute the corresponding local graph transforms.
The spatio-angular high frequency graph transform coefficients are coded as in the previous scheme. Using the received reference view and the high frequency coefficients, the decoder can reconstruct all the views as explained in Section 5.2.
7 Experimental analysis and results
We applied both coding schemes on real light fields captured by plenoptic cameras from the datasets used in [27] and [37]. To avoid the strong vignetting and distortion problems on the views at the periphery of the light field, we only consider the central sub-aperture images cropped to in [27], and cropped to from [37]. Some of the light fields considered are shown in Figure 6. The full set of light fields considered for the test is: Flower2, Cars, Rock and Seahorse from the dataset in [27] and StonePillarInside and Friends from the dataset of ICIP challenge 2017 and used in [37]. The method used to estimate the disparity of the top-left views is described in [35]. Examples of disparity maps provided are shown in Fig. 7. A sparse set of disparity values and the segmentation map of the reference view are computed with SLIC [34], and used to construct super-rays, i.e., the local graph supports as described in Section 3.1. We set the number of super-rays to to cope with the complexity issue of the non separable graph transform and show the effect of our graph prediction (the importance of exploring the long term dependencies between super-rays).
7.1 Non Separable vs Separable Graph Prediction
7.1.1 Energy compaction
As explained before, we aim at compacting most of the light field energy in few coefficients, and at then predicting these coefficients (i.e. they are not transmitted) from a coded reference image and from the high frequency graph transform coefficients that need to be transmitted but at small cost given that they contain little information. Table 1 gives the percentage of total energy that resides in the predicted DC spatio-angular bands for both non separable () and separable () cases. We can observe that most of the energy is compacted in the DC spatio-angular bands, which shows the efficiency in terms of spatio-angular de-correlation of the graph transforms.
The non separable prediction has the benefit of the low energy of the high frequency coefficients of the graph transform that also need to be coded. The separable graph transform, in some cases, looses this benefit as we predict the DC angular(i.e. after the transform across the views) coefficients of all spatial bands. Those low angular frequency coefficients may not contain all the energy otherwise captured by the lower spatio-angular frequency coefficients of the non separable case, although it remains quite efficient in terms of energy compaction as we can see in Table 1.
To further illustrate the energy compaction of the transforms, we plot in Fig. 8, for two different super-rays, the transform coefficients (except the first one which holds most of the energy) following the coding order (learned order of frequencies) for both cases: separable and non separable graph transforms. As we can see, in the non separable case, the low frequencies that are predicted on the decoder side (the red dots) correspond to the first frequencies and thus to those who hold most of the energy. However, in the separable case, the coefficients predicted do not necessarily exhibit the highest energy. This is quite clear in the second example, where the red dots in the separable case are assigned to very low values, and some high values still need to be sent to the decoder side. For the non separable case, the decorrelation has been perfect, and there is a very low energy in the coefficients shown in the plot.
7.1.2 Compressibility of the reference view
Thanks to the prediction equations introduced in Section 5.2, an efficient encoding of the top-left view in the separable case or the reference view in the non separable case (using any classical encoder with efficient spatial predictors) can be seen as a way to encode those DC spatio-angular frequency coefficients which contain most of the light field energy.
The separable graph transform based prediction takes advantage of the natural structure of the reference view as we can see in Fig. 3. It is thus efficiently coded using intra-prediction tools. For the non-separable graph prediction, however, this is not the case since the optimal sampling does not totally guarantee that the samples are well structured in each super-ray of the 2D reference view. Yet, the super-ray segmentation preserves in a certain way the natural structure of the reference view.
In our experiments, we choose HEVC intra to encode this information (i.e. top left view or reference view). Tables 2 and 3 give the bit rate obtained when encoding the reference view (from which are derived the DC spatio-angular frequency coefficients) with HEVC-Intra (with QP set to [math]). The bit rates are compared with those obtained when using a simple arithmetic coder for directly encoding the spatio-angular DC coefficients. In order to apply the arithmetic coder, for each frequency band , we first group all the coefficients of the super-rays in which this band exists, and we code them with an arithmetic coder independently of the other bands. The table shows the rate gain obtained by encoding the set of reference samples with HEVC intra, thanks to the possibility to capture dependencies between super-rays.
7.1.3 Robustness of the Prediction
In order to assess the efficiency of our prediction and the light field sampling algorithm for the non separable case, we plot in Fig. 9 the condition number in log base of the matrix for all super-rays in all the datasets. The condition number is measured to show how much sensible is our prediction in equation 19 to a small change in \Big{(}\mathbf{x}_{k}(\mathcal{S})-\mathbf{U}_{k}(\mathcal{S},\mathcal{T}_{c})\hat{\mathbf{x}}_{k}(\mathcal{T}_{c})\Big{)}. On one hand, the condition numbers are computed without sampling i.e. assuming that the reference samples are those in the top-left view. These are shown in red, while the results in blue correspond to the condition numbers after taking the actual samples found with algorithm 1. A major difference is shown in log scale, where the sampling has reduced the condition number from to a maximum of around . Without sampling, the prediction fails since a tiny change in the high frequency coefficients (even a small rounding procedure) can result in a huge loss in the reconstruction quality.
For the prediction based on the separable graph transform, we do not need a matrix inversion. We only need to invert a number whose minimum corresponds to . This inversion does have a smaller impact than the one in the non separable case. This is a major explanation of the PSNR difference between our schemes for a fixed quantization step size in Table 4.
7.2 RD Comparison against state of the art coders
We assess the proposed graph-based spatio-angular prediction methods in the context of quasi-lossless light field coding in comparison with a complete HEVC-based scheme with a QP set to [math] and a GOP size of . The HEVC version used in the tests is the HM-16.10. The light fields are coded following a raster scan order starting with the top-left view as a reference Intra-coded frame. Results are reported in Table 4 where we compare mainly the bit rate needed to code a light field in a quasi-lossless setting (we consider a PSNR higher that dB as a quasi-lossless compression). A substantial gain in bit rate is observed while preserving a high quality of the reconstructed light fields. This can be justified by the efficiency of the proposed spatio-angular graph transforms in terms of energy compaction along with the ability of HEVC-intra to effectively exploit spatial correlation in the reference view.
8 Conclusion
In this paper, we have proposed sampling and prediction methods with local graph transforms for light field energy compaction and compression. Based on the graph sampling theory, the proposed methods allow taking advantage of the good energy compaction property of the graph transform on local supports with a limited complexity, while benefiting from well established but powerful prediction mechanisms in the pixel domain. We considered both a super-ray based non separable graph transform and a spatio-angular separable simplified version. Two coding schemes have been described based on the non separable and separable graph transforms. The schemes have been assessed for high quality (quasi-lossless) coding. Both proposed approaches are very efficient when the quantization noise on the reference set of samples is low, hence for quasi-lossless compression. If the reference set of samples is too coarsely quantized, drift and noise amplification may appear during the prediction step. This is due to the fact that, in Equation (26), the prediction uses the spatial transform coefficients estimated on the reference set of samples available at the decoder side. Further study will be dedicated to addressing this problem in the case of lossy compression.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine , vol. 30, no. 3, pp. 83–98, 2013.
- 2[2] G. Shen, W.-S. Kim, S. K. Narang, A. Ortega, J. Lee, and H. Wey, “Edge-adaptive transforms for efficient depth map coding,” in Picture Coding Symposium (PCS), 2010 . IEEE, 2010, pp. 566–569.
- 3[3] W. Hu, G. Cheung, X. Li, and O. Au, “Depth map compression using multi-resolution graph-based transform for depth-image-based rendering,” in 2017 IEEE International Conference on Image Processing ICIP , Sept. 2012.
- 4[4] W.-S. Kim, S. K. Narang, and A. Ortega, “Graph based transforms for depth video coding,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on . IEEE, 2012, pp. 813–816.
- 5[5] C. Zhang and D. Florêncio, “Analyzing the optimality of predictive transform coding using graph-based models,” IEEE Signal Processing Letters , vol. 20, no. 1, pp. 106–109, 2013.
- 6[6] W. Hu, G. Cheung, A. Ortega, and O. C. Au, “Multiresolution graph fourier transform for compression of piecewise smooth images,” IEEE Transactions on Image Processing , vol. 24, no. 1, pp. 419–433, 2015.
- 7[7] W.-T. Su, G. Cheung, and C.-W. Lin, “Graph fourier transform with negative edges for depth image coding,” in Image Processing (ICIP), 2017 IEEE International Conference on . IEEE, 2017, pp. 1682–1686.
- 8[8] G. Cheung, E. Magli, Y. Tanaka, and M. Ng, “Graph spectral image processing,” Proceedings of the IEEE , vol. 106, no. 5, pp. 907–930, 2018.
