Geometry-Aware Graph Transforms for Light Field Compact Representation
Mira Rizkallah, Xin Su, Thomas Maugey, Christine Guillemot

TL;DR
This paper introduces geometry-aware local graph transforms for 4D light fields, optimizing energy compaction and correlation preservation through a novel transform method, and demonstrates improved rate-distortion performance over existing codecs.
Contribution
It proposes a new geometry-aware transform optimization method for non-isometric super-pixels, enhancing energy compaction in light field compression.
Findings
Improved energy compaction with the proposed transforms.
Enhanced correlation preservation for non-isometric super-pixels.
Better rate-distortion performance compared to HEVC and JPEG Pleno VM 1.1.
Abstract
The paper addresses the problem of energy compaction of dense 4D light fields by designing geometry-aware local graph-based transforms. Local graphs are constructed on super-rays that can be seen as a grouping of spatially and geometry-dependent angularly correlated pixels. Both non separable and separable transforms are considered. Despite the local support of limited size defined by the super-rays, the Laplacian matrix of the non separable graph remains of high dimension and its diagonalization to compute the transform eigen vectors remains computationally expensive. To solve this problem, we then perform the local spatio-angular transform in a separable manner. We show that when the shape of corresponding super-pixels in the different views is not isometric, the basis functions of the spatial transforms are not coherent, resulting in decreased correlation between spatial transform…
| Light Field | Rate allocation(in ) for the opt-separable GBT scheme | |||
|---|---|---|---|---|
| Overall bitrate | Segmentation | Disparity | Coefficients | |
| Cars | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| Flower2 | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| Rock | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| Seahorse | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| Friends | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| StonePillarInside | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
| FountainVincent | bpp (PSNR = dB) | |||
| bpp (PSNR = dB) | ||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Geometry-Aware Graph Transforms for Light Field Compact Representation
Mira Rizkallah*∗, Xin Su†, Thomas Maugey†, and Christine Guillemot†*,
∗ IRISA, Campus Universitaire de Beaulieu, 35042 Rennes, France
† INRIA Rennes Bretagne Atlantique, Rennes, France This work has been supported in part by the EU H2020 Research and Innovation Programme under grant agreement No 694122 (ERC advanced grant CLIM).
Abstract
The paper addresses the problem of energy compaction of dense 4D light fields by designing geometry-aware local graph-based transforms. Local graphs are constructed on super-rays that can be seen as a grouping of spatially and geometry-dependent angularly correlated pixels. Both non separable and separable transforms are considered. Despite the local support of limited size defined by the super-rays, the Laplacian matrix of the non separable graph remains of high dimension and its diagonalization to compute the transform eigen vectors remains computationally expensive. To solve this problem, we then perform the local spatio-angular transform in a separable manner. We show that when the shape of corresponding super-pixels in the different views is not isometric, the basis functions of the spatial transforms are not coherent, resulting in decreased correlation between spatial transform coefficients. We hence propose a novel transform optimization method that aims at preserving angular correlation even when the shapes of the super-pixels are not isometric. Experimental results show the benefit of the approach in terms of energy compaction. A coding scheme is also described to assess the rate-distortion perfomances of the proposed transforms and is compared to state of the art encoders namely HEVC and JPEG Pleno VM 1.1.
Keywords Light Fields, Energy Compaction, Transform coding, Super-rays, Graph Fourier Transform
1 Introduction
Recently, there has been a growing interest in light field imaging. By sampling the radiance of light rays emitted by the scene along several directions, light fields enable a variety of post-capture processing techniques such as refocusing, changing perspectives and viewpoints, depth estimation, simulating captures with different depth of fields and 3D reconstruction [1, 2, 3]. This however comes at the expense of collecting large volumes of redundant high-dimensional data, which appears to be one key downside of light fields.
Research effort has been recently dedicated to the design of light field compression algorithms, by either adapting standardized solutions (in particular HEVC) to light field data (e.g. [4] [5] [6]), by proposing homography-based low rank models for reducing the angular dimension [7], or by investigating local Gaussian mixture models in the 4D ray space [8]. The authors in [9], use a depth‐based segmentation of the light field into 4D spatio‐angular blocks with prediction followed by JPEG‐2000.
In this paper, we address the problem of graph transforms optimization for light fields energy compaction and compact representation. Indeed, light fields record illumination of light rays emitted by a scene in different orientations. The captured data for a static light field is represented by a 4D function , and contains redundant information in both the spatial and angular dimensions. Those correlations could in principle be represented by a huge non separable graph connecting pixels within and across views of the entire light field. The basis functions of a graph Fourier transform [10] could then be used to decorrelate the color signal residing on the graph vertices. However, such a graph would have a very high number of vertices, each vertex corresponding to a light ray. This makes the diagonalization of the laplacian matrix unfeasible, hence, the computation of the graph Fourier transform not practical.
To lower the dimensionality of the problem, we propose to partition the big graph structure into smaller ones that are coherent and correlated inside and across the views. This can be viewed as cutting unreliable edges from the global graph. To perform this partitioning, we group similar pixels within and across views based on the concept of super-rays defining the supports of the set of local graph transforms. The concept of super-ray has been introduced in [11] as an extension to light fields of the concept of super-pixels.
The authors in [12] used super-rays as the supports of separable shape-adaptive Discrete Cosine Transform (DCT). Super-pixels are used in [13] as the supports of local graph transforms, and tested in a predictive scheme based on view synthesis. The angular transform is however applied on super-pixels that are co-located on all views, hence not exploiting scene geometry, due to the difficulty to design separable graph transforms that at the same time follow the scene geometrical information and preserve angular correlations. We come back on this point in the sequel.
In this paper, we address the problem of designing local super-ray based non separable and separable graph transforms following the scene geometry. Towards this goal, we first propose a specific super-ray construction method to limit shape variations of the super-pixels forming a given super-ray. Despite the local support of limited size defined by the super-rays, the Laplacian matrix remains of high dimension and its diagonalization to compute the transform eigen vectors is computationally expensive. An intuitive way to solve this problem is to perform the transform in a separable manner: a first spatial transform applied per super-pixel inside each view, then an angular transform between corresponding super-pixels across the views to capture angular dependencies. We have however observed that if the shape of the super-ray undergoes a slight change between views, the basis functions computed from the graph laplacian have very different forms from one super-pixel to the corresponding ones in the other views, resulting in a decreased correlation between spatial transform coefficients.
The difficulty is therefore how to optimize the spatial transforms applied on each super-pixel of the super-ray in such a way that the angular correlation is well preserved. Preserving angular correlation is important in order to best compact the light field energy. The angular correlation is preserved, only if the eigen vectors of the spatial transforms computed independently on different shapes (the super-pixels forming the super-ray) are reasonably consistent, i.e. only when the shapes of the transform supports are approximately isometric. We propose in this paper a novel method to optimize the spatial transforms in such a way that the basis functions approximately diagonalize their respective Laplacians while being coherent across the views, given the scene geometry.
Experimental results show that the proposed super-ray construction method yields, for the light fields considered in the tests, up to percent coherent supports out of all super-rays, which facilitates the application of a separable graph transform. The results also show that the optimized separable graph transform yields higher energy compaction, and significant rate-distortion performance gains, compared to the non optimized separable transform, when some super-rays are shape-varying across the views. The proposed simple coding scheme based on these local separable transforms is shown to outperform light field coding schemes based on HEVC and JPEG Pleno at high bitrate following the common test conditions.
In summary our contributions are as follows:
- •
We propose local graph transforms based on the concept of super-rays adapted to scene geometry. To define the supports of the local transforms, we first propose (section 3) a new algorithm to segment the light field into super-rays. The method takes as input only the top-left color image and a sparse set of disparities. The resulting segmentation defines the supports of local graph transforms.
- •
We then introduce (section 4) a novel method to optimize the spatial transforms in such a way that the basis functions are coherent across the views, given the scene geometry.
- •
We analyze the properties in terms of energy compaction of the proposed super-rays based graph transforms.
- •
A complete coding scheme (section 5) is also described to assess the rate-distortion performances of these novel transforms on a set of real light fields.
2 Related work
We first briefly review prior work on graph transforms design for signal (and in particular image) energy compaction, problem related to the core of the paper. For sake of completeness, the proposed transforms being validated in a complete coding scheme, we also give a brief overview of recent work on light field compression.
2.1 Graph Transforms
Recently, graph signal processing has been applied to different image and video coding applications, especially for piecewise smooth images. In [14, 15], the authors propose a graph-based coding method where the graph weights are defined considering pairwise similarities between pixel intensities. Another efficient graph construction method has been proposed in [16] for piecewise smooth images. For each signal in a block, they select the Graph Fourier Transform minimizing the rate distortion cost. A signed graph Fourier transform has also been proposed in [17] for depth map coding, accounting for negative weights between pixels.
For natural images, most of the work has focused on designing sparse graphs or using graph templates that capture principal gradient-based structures in images [18][19]. This is mostly useful in textured images. While most of the aforementioned transform coding strategies did not account for the graph coding cost, in a later work [20], a rate-distortion optimized graph learning approach has been proposed to code natural images while taking into account both the sparsity of the transformed coefficients and the graph coding cost. Several graph based approaches have also been proposed to code intra and inter predicted residual blocks in video compression, using generalized graph Fourier transform [21], simplified graph templates transforms [22], or separate line graph based transforms [23].
In this paper, we build graphs that follow the scene geometry and we then propose separable graph based transforms that best exploit light fields spatial and angular correlation.
2.2 Light Fields Compression
Existing light fields compression solutions can be broadly classified into two categories: approaches directly compressing the lenslet images or approaches coding the views extracted from the raw data. Methods proposed for compressing the lenslet images mostly extend HEVC intra coding modes by adding new prediction modes to exploit similarity between lenslet images (e.g. [24], [25], [5], [6]). The authors in [9] propose a lenslet-based compression scheme that uses depth, disparity and sparse prediction followed by JPEG-2000 residue coding.
A second category of methods consists in encoding the set of views which can be extracted from the lenslet images after de-vignetting, demosaicing and alignment of the micro-lens array on the sensor, following e.g. the raw data decoding pipeline in [26]. Several methods code the views as pseudo video sequences using HEVC [4], [27], or the latest JEM coder [28], or extend HEVC to multi-view coding [29]. Low rank models as well as local Gaussian mixture models in the 4D rays space are proposed in [7] and [30] respectively. View synthesis based predictive coding has also been investigated in [31] where the authors use a linear approximation computed with Matching Pursuit for disparity based view prediction. The authors in [32] and [33] use instead a the convolutional neural network (CNN) architecture proposed in [34] for view synthesis and prediction. The prediction residue is then coded using HEVC [32], or using local residue transforms (SA-DCT) and coding [33]. The proposed transforms could also be used for residue coding. However, to best assess their de-correlation advantage, in the experiments reported below, they are directly applied on the color values of the entire 4D light field data.
3 Super-rays and Graph construction
The compression efficiency of any coder based on block partitioning and transform coding does undeniably depend on the way the partitioning is done, and on how the resulting segmentation adheres to object boundaries. While traditional transforms such as 2D DCT applied on a square or rectangular support may fail due to high frequencies captured on the object boundaries, here we rely on a segmentation of the entire 4D light field into super-rays.
3.1 Light field Segmentation in Super-Rays
The concept of super-ray has been introduced in [35] as an extension of super-pixels [36] to group light rays coming from the same 3D object, i.e. to group pixels having similar color values and being close spatially in the 3D space. The method performs a k-means clustering of all light rays based on color and distance in the 3D space. To deal with dis-occlusions, a slightly modified formulation is proposed in [12] where the dense depth information is also used in the clustering. When the depth information is not fully reliable, this method results in inconsistent super-rays across views. In addition, the signalling cost of such a global light field segmentation is high. In order to make the super-rays more consistent across the views, we suggest a modified version where we compute super-pixels in the top-left view as shown in Figure 1. Then, using the disparity map, we project the segmentation labels to all the other views. Namely, having a segmentation map in the top left view and the corresponding disparity map, we compute the median disparity per super-pixel, and use it to project the segmentation mask to the other views. More precisely, the algorithm proceeds row by row. In the first row of views, we perform horizontal projections from the top-left to the views next to it. For each other row of views, a vertical projection is first carried out from the top view to recover the segmentation on view , then horizontal projections from to the other views are performed, as shown in Figure 2.
An example of segmentation is shown in Figure 2, where the background consists of two yellow superpixels, and two foreground objects are labeled with red and pink. The disparity of the two objects is equal to , while the background is almost fixed with a disparity equal to [math]. At the end of each projection, some shapes are projected in all the views without interfering with others. Those typically represent flat regions inside objects (for example, the object labeled in pink). While others, mainly consisting of occluded and occluding segments end up superposed in some views, for example, the red object occluding pixels from the yellow background. In this case, the occluded pixels are assigned the label (e.g. red) of the neighboring super-ray corresponding to the foreground objects (i.e. having the higher disparity). As for appearing pixels, for example, between the yellow background and pink object, they will be clustered with the background super-rays (i.e. having the lower disparity e.g. yellow). The super-rays that end up with different shapes in the views are marked with a dashed contour.
3.2 Graph Construction
In order to jointly capture spatial and angular correlations between pixels in the light field, we first consider a local non separable graph per super-ray. More precisely, if we consider the luminance values in the whole light field and a segmentation map , the super-ray can be represented by a signal defined on an undirected connected graph which consists of a finite set of vertices corresponding to the pixels at positions such that . A set of edges connect each pixel and its 4-nearest neighbors in the spatial domain (i.e. in each view), and to its corresponding pixels, found by disparity based projection, in the 4 nearest neighboring views. An example of graph built inside a super-ray is shown in Figure 3.
4 Graph Transforms
In this section, we focus on the design of suitable transforms for the signals (color or residues) residing on the local graphs defined above.
4.1 Non Separable Graph Transform
Let us consider the super-ray and its corresponding local graph . We start by defining its adjacency matrix with entries , if there is an edge between two vertices and , and otherwise. The adjacency matrix is used to compute the Laplacian matrix , where is a diagonal degree matrix whose diagonal element is equal to the sum of the weights of all edges incident to node . The resulting Laplacian matrix is symmetric positive semi-definitive and therefore can be diagonalized as:
[TABLE]
where is the matrix whose rows are the eigenvectors of the graph Laplacian and is the diagonal matrix whose diagonal elements are the corresponding eigenvalues. The laplacian eigenbases are analogous to the Fourier bases in the Euclidean domain and allow representing the signals residing on the graph as a linear combination of eigenfunctions akin to Fourier Analysis. This is known as the Graph Fourier transform. For the signal defined on the vertices of the local graph, the transformed coefficients vector is defined in [10] as:
[TABLE]
The inverse graph Fourier transform is then given by
[TABLE]
Although this would be the ideal decorrelating transform for the signal, the Laplacian of such graph, despite the locality, remains of high dimension (almost nodes per super-ray) leading to a high transform computational cost. To limit the computational cost, we then consider separable local transforms.
4.2 Coherent Separable Graph Transform
The separable graph transform is defined by a first spatial transform followed by a second angular transform as detailed below.
4.2.1 First spatial graph transform
If we consider the luminance values in only one sub-aperture image of the light field and a segmentation map , the super-ray can be represented by a signal defined on an local spatial graph with only connections in the spatial domain (i.e. between the neighboring pixels in a super-pixel, but not across the views in a super-ray). The matrix , being the eigen-vectors of the spatial laplacian , is used to compute the first spatial graph transform : For the signal defined on the vertices of the graph, the transformed coefficients vector is defined in [10] as:
[TABLE]
The inverse spatial graph Fourier transform is then given by
[TABLE]
4.2.2 Second angular graph transform
In order to capture inter-view dependencies and compact the energy into fewer coefficients, we perform a second graph based transform, in the angular dimension. Note that, for a given super-ray, we do not necessarily have the same number of pixels, hence coefficients resulting from the spatial transforms, in all the views. For a given band (coefficients corresponding to the eigenvectors of the spatial transforms), we construct a graph of vertices corresponding to the views where the band exists. Edges are drawn between each node and its direct four neighbors. Isolated nodes are connected to their nearest neighbor.
The Adjacency is used to compute the inter-view angular unweighted Laplacian as with the degree matrix. can be diagonalized as:
[TABLE]
For a specific band number and super-pixel , the band signal is defined as . The angular Graph Transform consists of projecting the signal onto the eigenvectors of as:
[TABLE]
The inverse angular Graph Transform is then given by
[TABLE]
4.2.3 Coherence of spatial graph transforms in corresponding super-pixels
The spatial graphs in the different super-pixels forming one super-ray may not have the same shape. Furthermore, we have observed that for a specific super-ray, when the spatial graph topology in the corresponding super-pixels undergoes a slight change, the basis functions of each spatial graph transform are different and thus incompatible with each others (refer to Figure 4 before optimization), resulting in decreased correlation of the spatial transform coefficients across views. This is shown in the sequel to severely decrease the efficiency of the angular transform.
Basically, during the diagonalization procedure, the eigenfunctions are only defined up to sign flips for Laplacians having a simple spectrum (if the eigenvalues have a multiplicity of 1, for example connected graphs). Therefore, even having the same shape in two different views, we may end up with two opposite eigen-vectors for a specific eigenvalue during the diagonalization.
Moreover, eigenvectors computed independently on two different shapes (i.e. corresponding to two different Laplacians) can be expected to be reasonably consistent only when the shapes are approximately isometric. Whenever this assumption is violated, it is impossible to expect that the eigenvector of a Laplacian in view will correspond to the eigenvector of another Laplacian in view . If the basis functions do not behave consistently on the corresponding points of the two shapes, the two signals defined on those two Laplacians will be projected onto incompatible basis functions (see Figure 4), and therefore we cannot guarantee any correlation to be preserved after performing the first spatial graph transform.
4.2.4 Coherent spatial graph transform
In order to overcome those limitations, we consider an approach which aims at finding coupled basis functions.
More precisely, suppose that, in a super-ray in a reference view [math] and a target view , we have two Laplacians and with size and respectively. They can be diagonalized as:
[TABLE]
If the two Laplacians are equal, we make sure that their eigenvectors are compatible with sign flips accordingly. We check the first value of the each eigenvector and flip its sign if the value is negative.
In the case where the super-pixel shapes in the sub-aperture images are not isometric, we propose to diagonalize one specific spatial graph Laplacian and find . Then, we search for basis vectors that approximately diagonalize any other spatial graph Laplacian and at the same time preserve correlations after the transform. Inspired by the work of [37], we pose the problem as
[TABLE]
where we seek to minimize the weighted sum of two terms subject to the orthonormality constraint of the computed basis functions . The first term is a diagonalization term that aims at minimizing the energy residing on off-diagonal entries (. The second term aims at enforcing coherence between the two spatial graph transforms and is defined as follows.
Based on the geometry information we have in hand, we can actually define, a priori, a set of correspondences between and . More precisely, we suppose that we have a set of corresponding functions represented by matrices and of sizes and respectively. An example of and is shown in figure 5.
The basis functions of both Laplacians are supposed to be consistent if the Fourier coefficients of the functions and on and are approximately equal i.e. if . To avoid over-determining the problem, we use the farthest point sampling technique restricting the correspondence points to a maximum of points.
If we parametrize the new basis functions of as being a linear combination of the old basis functions, we can write where is a matrix of combination coefficients, that plays a role of reflecting and rotating the original basis vectors in so that they will align the best way with while almost diagonalizing the laplacian . Using the diagonalizing property of , we can re-write Equation (10) as
[TABLE]
It is important to note that the first term of the above problem does not guarantee a preserved increasing order of the eigenfunctions. It is therefore more convenient to use an alternative penalty equal to that relates not only to the diagonalization property, but also to the distribution of the energies across the basis functions after the optimization.
[TABLE]
The problem in Equation (12) is a non linear optimization problem with an orthogonality constraint, which can be solved by iterative minimization algorithms. In our case, we used Matlab optimization toolbox (interior point method of the fmincon function) to solve it. The gradients of the cost function terms are given in appendix A.
Since we are dealing with large datasets and a large number of super-rays, it is convenient to use parallel computing to independently compute eigen-basis for the different super-rays. Also, in order to reduce the complexity of the problem, we propose to split it into smaller problems that are independent: we pick a small number of eigenvectors to be optimized at a time. Then, for each disjoint group of eigenvectors in , we formulate a sub-problem by expressing new eigenvectors as a linear combination of old eigenvectors. Noticing that and
[TABLE]
For each group of eigenvectors, we find of size that will minimize the objective function on the subset of eigenvectors.
[TABLE]
At the end of the optimization stage, most of the eigenvectors are thereby compatible across views and the transform will necessarily preserve any correlation already observed between views. An example of the second eigenvector of a super-ray before and after optimization is shown in Figure 4. While eigenvectors corresponding to higher frequencies are harder to adjust, the low frequency eigenvectors can be easily optimized. In our application, this is not a big problem since we have a high energy compaction in lower frequency bands, and those are the bands that matter the most for reconstruction. After performing the segmentation and two transforms, most of the energy of the color signal is indeed expected to be concentrated in a very small number of coefficients. In the following section, we aim at exploiting this energy compaction property to efficiently code the redundant information present in the light field using the tools introduced above.
5 Light Field Coding Scheme
The overall steps of the compression algorithm are shown in Figure 6. The top left view of the Light Field is separated into uniform regions using the SLIC algorithm to segment the image into super-pixels [36], and its disparity map is estimated. Using both the segmentation map and the geometry information, we construct consistent super-rays in all views as explained in section 3. The non separable and separable transforms described above are then locally applied on each super-ray. The transformed coefficients are then quantized and encoded to be stored or transmitted. The segmentation map of the reference view and a disparity value per super-ray also need to be transmitted as side information to the decoder.
5.0.1 Segmentation map and disparity values coding
The segmentation map of the reference view is encoded using the arithmetic edge coder proposed in [38]. The contours are first represented by differential chaincode [39] and divided into segments. Then, to efficiently encode a sequence of symbols in a segment, AEC uses a linear regression model to estimate probabilities, which are subsequently used by the arithmetic coder. Disparity values are encoded using an arithmetic coder.
5.0.2 Grouping and transform coefficients coding
The energy compaction is not the same in all super-rays. This can be explained by the fact, that the segmentation may not well adhere to object boundaries, resulting in high angular frequencies after optimization of the first spatial transform.
To optimize the coding performance, we divide the set of super-rays into four classes, where each class is defined according to an energy compaction criterion.
First, we learn a scanning order. More precisely, at the end of the two graph transform stages, coefficients are grouped into a three-dimensional array where is the transformed coefficient of the band for the super-ray . Using the observations on all the super-rays in some training datasets (Flower1,Friends), we can find the best ordering for scanning and quantization. We sort the variances of coefficients with enough observations in decreasing order and we follow this decreasing order during the scanning process.
Then, each super-ray with coefficients belongs to class if the mean energy per high frequency coefficient is less than , where the high frequency coefficients are the last coefficients following the scanning order of the super-rays coefficients defined previously. We start by finding the super-rays in the first class than remove them from the search space before finding the other classes, and idem for the following steps. We code a flag with an arithmetic coder to gives the information of the class of super-rays to the decoder side. In class , the last coefficients of each super-ray are discarded. The rest of the coefficients are grouped into uniform groups. The quantization step sizes in groups are defined with a rate-distortion optimization taking into account a big number of observed coefficients. At the end of this stage, for each class, each group is coded using the Context Adaptive Binary Arithmetic Coder (CABAC) from the HEVC H.265 reference coder.
6 Experimental analysis
For performance evaluation, we consider real light fields captured by plenoptic cameras from the datasets used in [34] and [40]. We consider the central sub-aperture images cropped to in [34], and cropped to from [40] in order to avoid the strong vignetting and distortion problems on the views at the periphery of the light field. The disparity map of the top left view of each light field has been estimated using the method in [41]. The estimated disparity map is used to construct super-rays as described in Section 3.
6.1 Assessment of the proposed super-ray construction method
In this section, we assess how the proposed super-ray construction method deals with occluded and dis-occluded parts, and to which extent the super-rays are consistent despite uncertainty on the disparity information. Figure 7 shows examples of super-rays obtained with different real light fields captured by a Lytro Ilum camera (Flower 2, Rock used in [34], and FountainVincent, StonePillarInside used in [40]). In the first three columns, we have the original top left corner view, its corresponding disparity map and super pixel segmentation using the SLIC algorithm [36] respectively. In the fourth column, we show horizontal and vertical epipolar segments taken both from the 4D light field color information and our final segmentation in specific regions of the image (the red blocks). We can see that we are following well the object borders, especially when the disparity map is reliable. Also, we have always attained a high percentage of coherent super-rays across views (higher than as measured with Cons() in the fifth column). More precisely, Cons() gives the percentage of coherent super-rays: A super-ray is coherent when it is made of super pixels having the same shape in all the views, with or without a displacement.
At the end of this segmentation stage, we end up with a segmentation map with consistent super rays in flat objects and shape-varying super-rays mainly on the borders.
6.2 Analysis of proposed graph based optimized transforms
In this section, we analyze the performance of our optimization process described in section 4.2 and its effect on the transform coding efficiency. In all the experiments, for each super-ray we find the super-pixel that is on the top-left most of the light field, and fix it as reference for the coupling process. We therefore optimize the maximum number of eigenvectors defined as with being the number of pixels in the reference super-pixel. An example of input and output of the coupling process for a shape-varying super-ray is illustrated in Figure 8.
We see that the consistency of eigenvectors in the different graphs is much better after our optimization. If we project the light field signal residing in the super-ray on the optimized coupled eigenvectors, the inter-view correlation is better preserved compared to the non optimized eigenvectors.
6.2.1 Energy Compaction of the spatial transform
Figure 9 shows the energy compaction observed in the spatial transform domain, then in the spatio-angular transform domain, i.e. after performing the first spatial transform and after performing both spatial and angular transforms on the color signal of the light fields. The energy compaction is computed for both optimized and non optimized cases. It denotes the percentage of energy if we keep some of the coefficients and discard others. For the spatial transform, we gather the transform coefficients of all super-pixels, and then we scan them following the intuitive order increasing order of the Laplacian eingevalues to compute the compaction. For the spatio-angular compaction, we follow the learned sub-optimal scanning order using different observations from the different datasets as explained in section 5.0.2.
If we compare the energy compaction of the spatial transforms only (red and blue curves) for different datasets, we observe that we may loose in terms of energy compaction for some datasets after optimization. In order to explain such loss, we analyze how the graphs are varying under the new basis functions after optimization. An example is shown in Figure 10 where edges between highlighted nodes are added implicitly in the graph after coupling. The new underlying Laplacian is computed as .
The underlying assumption behind the optimization procedure is that the signal can be modeled by a modified Gaussian distribution (Gaussian Markov Random Field) with a modified precision matrix which is equivalent to the new Laplacian matrix with some added small weights. Since this procedure is modifying the original graph structure, it may, in some cases, bring some high frequencies.
6.2.2 Correlation and Energy Compaction after angular transform
The gain in compaction after the spatio-angular transform is clear in Figure 9 when we perform the optimization. This is due to the fact that we are able to preserve angular correlations after the spatial transform, which will be subsequently exploited by the angular transform.
In order to assess the performance of our coupling process in preserving the correlation, we draw in Figure 11, the correlation matrices and the covariance matrices for some bands after the first transform with shape-varying super-rays. If we restrict our attention to the first column, We see that after the first transform that is not optimized, we have uncorrelated transform coefficients due to the perturbation of eigenvectors computed on super-pixels having slightly different shapes. This problem is almost resolved with our coupling procedure in the second column, where we can observe more correlation between the coefficients of the same band in neighboring views. Furthermore, the logarithm of the variances (values lying on the diagonal in the covariance matrices) being higher in the first low frequency bands and decreasing when moving further from the DC, shows the energy compaction of the first transform. As for the values of the off-diagonal elements of the covariance matrices, they show how correlated are the transformed coefficients after the first transform inside the views. If we observe the off-diagonal values and compare them with or without optimization, we find out that the optimization performs better for low frequencies than for high frequencies and is therefore more able to retrieve coherent basis functions.
After the second angular transform per band, for both cases with or without optimization, we compute the logarithm of coefficients’ variances after the second transform and illustrate it in the third row where the x-axis and y-axis correspond to the band number and the view number respectively. A compaction of the energy in fewer coefficients is observed in the optimized case compared to the non-optimized case, especially when we focus on the top-left region. Some inter-view high frequencies are sometimes still there and might be due to the presence of some super-rays are made of super-pixels that adhere well to borders in some views while not adhering in some others due to disparity rounding effects.
6.2.3 Impact of disparity errors
When the disparity information is not reliable, dis-occluded pixels may be clustered with a wrong super-ray, resulting in high frequencies, hence poor energy compaction, after the spatial transforms in those specific regions. As explained before, we overcome this problem by dividing the super-rays into classes.
6.2.4 Impact of super-rays size
The size of super-rays may have an impact on the rate distortion performance especially when the disparity information is reliable and there is a lot of homogeneous objects. If we have large objects, we might want to merge some small super-rays which makes a non separable graph transform practically unfeasible. Here comes the advantage of an optimized separable graph transform where one can define the number of eigenvectors to be optimized depending on the homogeneity of the shape-varying super-rays inside the views. In this case, the segmentation and disparity costs will more likely drop also since we also have less contours and values to code.
In our experiments, however, we use a uniform segmentation into super-pixels. We fix the number of super-rays to for the light fields in [34], and for the light fields in [40]. We have observed that when we have a small number of super-rays, the disparity errors may have an impact on the compensation and therefore result in a decreased PSNR-Rate performance. On the other hand, having a very large number of super-rays increases the rate needed for segmentation and limits the dimension of each super-ray, resulting in a smaller benefit in terms of de-correlation of the proposed spatio-angular transform.
6.3 Comparative assessment
We assess the compression performance obtained with our graph based transform coding schemes against two schemes: direct encoding of the views as a video sequence following a lozenge order (HEVC lozenge) [27], and using the JPEG Pleno VM 1.1 software used as anchor in [40].
In the simulations, the basic configuration files of JPEG Pleno VM have been used with small changes in order to be applied on views. For HEVC-lozenge, the base QPs are set to , , , and a GOP of is used. The HEVC version used in the tests is HM-16.10.
In Figure 12, our coding scheme based on both non separable and separable graph transforms is investigated against HEVC-lozenge and JPEG pleno 1.1 for three light fields with views, from the ICIP 2017 Grand Challenge [40]. Further experiments are also depicted in Figure 13 for light fields. For the separable case, we compare the optimized and the non optimized graph transform. In Table 1, we restrict our attention to the optimized separable graph based transform case that we denote by opt-separable GBT scheme that can be applied no matter how big the super-rays are. It shows the rate allocation of our method, at low and high bitrates, for the different light fields.
We can observe that, for most of the light fields used in our tests, the non separable graph transform yields a better rate-distortion performance compared to the separable case for a fixed number of super-rays. While the non optimized graph transform fails to compact the energy of the light field, the optimized graph transform is performing better and sometimes almost catches the non separable case. One major advantage of the separable optimized case is that it can be applied on super-rays of large dimensions without facing the basis functions computational complexity issue of the non separable case. Furthermore, the number of eigenvectors to be optimized can be defined by the encoder and does not have to be necessarily large.
Moreover, we can observe a better performance of our method at high bitrate compared to JPEG Pleno VM 1.1 and HEVC lozenge. At low bitrate, the prediction in the HEVC and JPEG Pleno based schemes is better than our disparity compensation of super-rays. Also, the bitrate allocated to the segmentation and disparity is very large, especially at low bitrate (almost reaching percent for most datasets) and could be further reduced.
Note that the decoder needs to compute the optimized basis functions for the non consistent super-rays, inducing some computational complexity. However, the optimization can be performed independently on each super-ray, in a parallel manner.
7 Conclusion
In this paper, we have addressed the problem of local geometry-aware graph transform design for light field energy compaction and compact representation. The transform support is based on super-rays constructed in a way that their shape remains coherent across the different views. We have first considered both non separable graph transforms.
Despite the limited size of the transform support, the Laplacian matrix of such graph remains of high dimension and its diagonalization to compute the transform eigenvectors is computationally expensive.
To solve this problem, we then considered a separable spatio-angular transform. We have shown that, when the shape of corresponding super-pixels in the different views undergoes small changes, the basis functions of the spatial transforms are not coherent, resulting in a decreased correlation between spatial transform coefficients. We hence proposed a novel transform optimization method that aims at preserving angular correlation even when the shapes of corresponding super-pixels (i.e. forming one super-ray) are not isometric. This procedure has been shown to increase energy compaction of the separable spatio-angular graph transforms and bring substantial rate-distortion performance gains compared to a non optimized case. The proposed optimized spatio-angular graph transforms can be applied on both color or residual signals and can be easily parallelized to reduce the complexity on the decoder side.
Acknowledgment
The authors would like to thank Elian Dib, Navid Mahmoudian Bidgoli and Pierre Allain from Inria for their help in extracting and running the JPEG Pleno VM 1.1 and HEVC CABAC encoder. Also, we would like to thank Effrosyni Simou and Eda Bayram from EPFL, who were of a great help with various discussions about the subject.
8 Gradients of the objective function terms
The gradients of the two terms in the optimization of equation 12 are provided below:
[TABLE]
As for the coupling term, with a similar derivation as the first gradient and using the trace derivation properties in [42], we get:
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Wilburn, N. Joshi, V. Vaish, E.-V. Talvala, E. Antunez, A. Barth, M. Adams, A.and Horowitz, and M. Levoy, “High performance imaging using large camera arrays,” ACM Trans. Graph. , vol. 24, no. 3, pp. 765–776, Jul. 2005.
- 2[2] R. Ng, “Light field photography,” Ph.D. dissertation, Stanford University, 2006.
- 3[3] T. Georgiev and A. Lumsdaine, “Focused plenoptic camera and rendering,” J. of Electronic Imaging , vol. 19, no. 2, Apr. 2010.
- 4[4] D. Liu, L. Wang, L. Li, Z. Xiong, F. Wu, and W. Zeng, “Pseudo-sequence-based light field image compression,” in 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) . IEEE, 2016, pp. 1–4.
- 5[5] C. Conti, L. D. Soares, and P. Nunes, “Hevc-based 3d holoscopic video coding using self-similarity compensated prediction,” Signal Processing: Image Communication , vol. 42, pp. 59–78, 2016.
- 6[6] Y. Li, R. Olsson, and M. Sjöström, “Compression of unfocused plenoptic images using a displacement intra prediction,” in 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) . IEEE, 2016, pp. 1–4.
- 7[7] X. Jiang, M. Le Pendu, R. A. Farrugia, and C. Guillemot, “Light field compression with homography-based low-rank approximation,” IEEE Journal of Selected Topics in Signal Processing , vol. 11, no. 7, pp. 1132–1145, 2017.
- 8[8] R. Verhack, T. Sikora, L. Lange, R. Jongebloed, G. Van Wallendael, and P. Lambert, “Steered mixture-of-experts for light field coding, depth estimation, and processing,” in 2017 IEEE International Conference on Multimedia and Expo (ICME), . IEEE, 2017, pp. 1183–1188.
