Methodology for accurately assessing the quality perceived by users on   360VR contents

Lara Mu\~noz; C\'esar D\'iaz; Marta Orduna; Jos\'e Ignacio Ronda,; Pablo P\'erez; Ignacio Benito; Narciso Garc\'ia

arXiv:1905.03508·cs.MM·May 21, 2020

Methodology for accurately assessing the quality perceived by users on 360VR contents

Lara Mu\~noz, C\'esar D\'iaz, Marta Orduna, Jos\'e Ignacio Ronda,, Pablo P\'erez, Ignacio Benito, Narciso Garc\'ia

PDF

TL;DR

This paper introduces a flexible methodology for accurately assessing user-perceived quality in 360VR content, considering viewport-specific quality over time for fairer evaluation of encoding and transmission strategies.

Contribution

It presents a novel, geometry-based assessment method that can be used both offline and online to evaluate viewport quality in 360VR, enabling fairer comparisons.

Findings

01

Provides a robust, geometry-based quality assessment framework.

02

Supports both offline and online evaluation modes.

03

Facilitates fair comparison of 360VR encoding strategies.

Abstract

To properly evaluate the performance of 360VR-specific encoding and transmission schemes, and particularly of the solutions based on viewport adaptation, it is necessary to consider not only the bandwidth saved, but also the quality of the portion of the scene actually seen by users over time. With this motivation, we propose a robust, yet flexible methodology for accurately assessing the quality within the viewport along the visualization session. This procedure is based on a complete analysis of the geometric relations involved. Moreover, the designed methodology allows for both offline and online usage thanks to the use of different approximations. In this way, our methodology can be used regardless of the approach to properly evaluate the implemented strategy, obtaining a fairer comparison between them.

Tables5

Table 1. TABLE I: Mean relative error for different geometric approximations

Number of masks	Mean relative error
3x6	3.78%
5x10	2.16%
10x20	0.69%
20x40	0.29%

Table 2. TABLE II: Aggregated spatial and temporal quality pooling for user 32, content ’game’ and different segment lengths. T Q = 0.8 subscript 𝑇 Q 0.8 T_{\text{Q}}=0.8

Segment length	$q_{window}$	$f_{window}$
500 ms	0.9650	95.89%
2000 ms	0.8679	74.11%
6000 ms	0.7232	53.72%

Table 3. TABLE III: Aggregated spatial and temporal quality pooling for user 28, segment length of 2000 ms and different contents. T Q = 0.8 subscript 𝑇 Q 0.8 T_{\text{Q}}=0.8

Content	$q_{window}$	$f_{window}$
coaster	0.9700	95.94%
game	0.8956	79.33%
landscape	0.8137	64.39%

Table 4. TABLE IV: Aggregated spatial and temporal quality pooling for content ’coaster2’, segment length of 2000 ms and different users. T Q = 0.8 subscript 𝑇 Q 0.8 T_{\text{Q}}=0.8

User	$q_{window}$	$f_{window}$
user 8	0.7999	68.22%
user 42	0.8689	81.67%
user 11	0.9654	98.00%

Table 5. TABLE V: Average temporal and spatial quality pooling per content and segment length. T Q = 0.8 subscript 𝑇 Q 0.8 T_{\text{Q}}=0.8

Content	Segment length
	500 ms		2000 ms		6000 ms
	$q_{window}$	$f_{window}$	$q_{window}$	$f_{window}$	$q_{window}$	$f_{window}$
coaster	0.9779	98.48%	0.9250	89.81%	0.8472	79.55%
coaster2	0.9787	98.47%	0.9258	89.57%	0.8396	76.85%
diving	0.9753	98.53%	0.9075	86.10%	0.7619	63.45%
drive	0.9697	97.50%	0.8845	81.95%	0.7609	64.43%
game	0.9767	97.92%	0.9041	85.84%	0.8231	74.98%
landscape	0.9718	97.30%	0.8707	78.21%	0.7358	60.65%
pacman	0.9787	97.80%	0.9230	88.37%	0.8440	77.61%
panel	0.9670	97.33%	0.8785	80.59%	0.7014	57.58%
ride	0.9751	97.87%	0.9037	85.81%	0.8075	70.37%
sport	0.9731	98.28%	0.9048	86.04%	0.7787	64.49%
Average	0.9744	97.95%	0.9028	85.23%	0.7904	68.99%

Equations34

{θ = arctan (x / y) ϕ = arctan (x^{2} + y^{2} / z) ⎩ ⎨ ⎧ x = sin ϕ sin θ y = sin ϕ cos θ z = cos ϕ

{θ = arctan (x / y) ϕ = arctan (x^{2} + y^{2} / z) ⎩ ⎨ ⎧ x = sin ϕ sin θ y = sin ϕ cos θ z = cos ϕ

S A = 4 arcsin (sin \frac{ϕ _{VP}}{2} sin \frac{θ _{VP}}{2})

S A = 4 arcsin (sin \frac{ϕ _{VP}}{2} sin \frac{θ _{VP}}{2})

N_{picture} = i . j \sum a (θ_{i}, ϕ_{j}) = N_{H} j \sum sin ϕ_{j} = \frac{2}{π} N_{H} N_{V}

N_{picture} = i . j \sum a (θ_{i}, ϕ_{j}) = N_{H} j \sum sin ϕ_{j} = \frac{2}{π} N_{H} N_{V}

N_{viewport} = \frac{2}{π ^{2}} N_{H} N_{V} arcsin (sin \frac{ϕ _{VP}}{2} sin \frac{θ _{VP}}{2})

N_{viewport} = \frac{2}{π ^{2}} N_{H} N_{V} arcsin (sin \frac{ϕ _{VP}}{2} sin \frac{θ _{VP}}{2})

⎩ ⎨ ⎧ E = (18 0^{o}, 9 0^{o} - ϕ_{VP} /2) F = (18 0^{o}, 9 0^{o} + ϕ_{VP} /2) G = (18 0^{o} - θ_{VP} /2, 9 0^{o}) H = (18 0^{o} + θ_{VP} /2, 9 0^{o})

⎩ ⎨ ⎧ E = (18 0^{o}, 9 0^{o} - ϕ_{VP} /2) F = (18 0^{o}, 9 0^{o} + ϕ_{VP} /2) G = (18 0^{o} - θ_{VP} /2, 9 0^{o}) H = (18 0^{o} + θ_{VP} /2, 9 0^{o})

⎩ ⎨ ⎧ ϕ_{A} = ϕ_{B} = 9 0^{o} - 2 arctan (tan \frac{ϕ _{VP}}{2} cos \frac{θ _{VP}}{2}) ϕ_{C} = ϕ_{D} = 9 0^{o} + 2 arctan (tan \frac{ϕ _{VP}}{2} cos \frac{θ _{VP}}{2})

⎩ ⎨ ⎧ ϕ_{A} = ϕ_{B} = 9 0^{o} - 2 arctan (tan \frac{ϕ _{VP}}{2} cos \frac{θ _{VP}}{2}) ϕ_{C} = ϕ_{D} = 9 0^{o} + 2 arctan (tan \frac{ϕ _{VP}}{2} cos \frac{θ _{VP}}{2})

\left|\begin{array}[]{ccc}x_{L}&y_{L}&z_{L}\\ x_{A}&y_{A}&z_{A}\\ x_{B}&y_{B}&z_{B}\end{array}\right|=0,

\left|\begin{array}[]{ccc}x_{L}&y_{L}&z_{L}\\ x_{A}&y_{A}&z_{A}\\ x_{B}&y_{B}&z_{B}\end{array}\right|=0,

ϕ_{L} = arctan (- \frac{γ}{α sin θ _{L} + β cos θ _{L}}),

ϕ_{L} = arctan (- \frac{γ}{α sin θ _{L} + β cos θ _{L}}),

⎩ ⎨ ⎧ \normalcolor α = \normalcolor β = \normalcolor γ = \normalcolor \normalcolor sin ϕ_{A} cos θ_{A} cos ϕ_{B} - sin ϕ_{B} cos θ_{B} cos ϕ_{A} \normalcolor - sin ϕ_{A} sin θ_{A} cos ϕ_{B} + sin ϕ_{B} sin θ_{B} cos ϕ_{A} \normalcolor sin ϕ_{A} sin θ_{A} sin ϕ_{B} cos θ_{B} \normalcolor - sin ϕ_{B} sin θ_{B} sin ϕ_{A} cos θ_{A} .

⎩ ⎨ ⎧ \normalcolor α = \normalcolor β = \normalcolor γ = \normalcolor \normalcolor sin ϕ_{A} cos θ_{A} cos ϕ_{B} - sin ϕ_{B} cos θ_{B} cos ϕ_{A} \normalcolor - sin ϕ_{A} sin θ_{A} cos ϕ_{B} + sin ϕ_{B} sin θ_{B} cos ϕ_{A} \normalcolor sin ϕ_{A} sin θ_{A} sin ϕ_{B} cos θ_{B} \normalcolor - sin ϕ_{B} sin θ_{B} sin ϕ_{A} cos θ_{A} .

\left[\begin{array}[]{ccc}1&0&0\\ 0&\cos(90^{o}-\phi)&\sin(90^{o}-\phi)\\ 0&-\sin(90^{o}-\phi)&\cos(90^{o}-\phi)\end{array}\right].

\left[\begin{array}[]{ccc}1&0&0\\ 0&\cos(90^{o}-\phi)&\sin(90^{o}-\phi)\\ 0&-\sin(90^{o}-\phi)&\cos(90^{o}-\phi)\end{array}\right].

\tilde{m}_{i, p} = m_{i, p} \cdot w_{p}^{a} \forall p \in M_{i}

\tilde{m}_{i, p} = m_{i, p} \cdot w_{p}^{a} \forall p \in M_{i}

Q_{i} = M_{i} \circ V_{i} .

Q_{i} = M_{i} \circ V_{i} .

q_{frame, i} = \frac{1}{N _{viewport}} p \in M_{i} \sum q_{i, p}

q_{frame, i} = \frac{1}{N _{viewport}} p \in M_{i} \sum q_{i, p}

q_{frame, i} = \frac{1}{N _{viewport}} p \in Q_{i} \sum q_{i, p}

q_{frame, i} = \frac{1}{N _{viewport}} p \in Q_{i} \sum q_{i, p}

q_{window} = \frac{1}{N _{f}} i = 0 \sum N_{f} - 1 q_{frame, i}

q_{window} = \frac{1}{N _{f}} i = 0 \sum N_{f} - 1 q_{frame, i}

f_{window} = \frac{1}{N _{f}} i = 0 \sum N_{f} - 1 [q_{frame, i} \mbox > T_{Q}]

f_{window} = \frac{1}{N _{f}} i = 0 \sum N_{f} - 1 [q_{frame, i} \mbox > T_{Q}]

Q_{i}^{'} = M_{i}^{'} \circ V_{i}^{'} .

Q_{i}^{'} = M_{i}^{'} \circ V_{i}^{'} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Methodology for accurately assessing the quality perceived by users on 360VR contents

Lara Muñoz, César Díaz, Marta Orduna, José Ignacio Ronda, Pablo Pérez, Ignacio Benito, and Narciso García L. Muñoz, C. Díaz, M. Orduna, J. I. Ronda and N. García are with the Grupo de Tratamiento de Imágenes, Information Processing and Telecommunications Center and ETSI Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain, e-mails: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]. Pérez and I. Benito are with Nokia Bell Labs, Madrid, Spain, e-mail: [email protected]; [email protected]

Abstract

To properly evaluate the performance of 360VR-specific encoding and transmission schemes, and particularly of the solutions based on viewport adaptation, it is necessary to consider not only the bandwidth saved, but also the quality of the portion of the scene actually seen by users over time. With this motivation, we propose a robust, yet flexible methodology for accurately assessing the quality within the viewport along the visualization session. This procedure is based on a complete analysis of the geometric relations involved. Moreover, the designed methodology allows for both offline and online usage thanks to the use of different approximations. In this way, our methodology can be used regardless of the approach to properly evaluate the implemented strategy, obtaining a fairer comparison between them.

Index Terms:

360VR, video quality, viewport, QoE

I Introduction

During the last few years, the interest for Virtual Reality (VR) has grown exponentially. Everyday, more and more VR-related applications appear, and the number of VR-ready devices, particularly of head-mounted displays (HMD), is quickly expanding, as they become appealing and affordable to an increasing number of users. One of the most common VR applications is the visualization via streaming of 360 videos in non-interactive environments, covering a wide range of applications such as education [1], medical treatments [2], and simulators [3].

The transmission of this type of content is particularly challenging. The main reason is that, due to the nature of the environment and the features of the associated presentation systems, the requirements in terms of image resolution and quality to offer a really immersive experience to the user are especially demanding [4, 5]. This results in sequences that require very high bit rates for their transmission. Thus, to provide a smooth playback and good quality service, it is necessary to incorporate coding and transmission management schemes that take advantage of the fundamental fact that only a fraction of the image can be seen by the user at a certain moment (Figure 1). This fraction depends on the HMD, whose design sets the field of view (FoV) and, therefore, determines the viewport, the picture area shown to the user. As the FoV is usually a right rectangular pyramid, defined by the two angles between opposing planes (dihedral angles), the projection of the viewport on the sphere is a spherical rectangle, centered on the point of gaze (PoG) of the user, that is where the user is looking at.

These coding and transmission schemes have the objective of offering high quality to the users while saving bits. It can be achieved by providing higher quality to the area that will presumably be visible to the user and lower quality to the area with a lower probability of being visible to that person. In this way, they can deliver a good quality of experience (QoE) while saving bandwidth. For this purpose, the raw image is tessellated into rectangular subimages, which are compressed and managed differently in an intelligent way. So, this procedure paves the way for the adaptation of the content presented to the viewer. In this sense, both encoding and transmission modules need to be ready to enable viewport adaptation. Regarding encoding, the technique most commonly employed to process in an independent way different areas of the image is the use of tiles, a tool included in the H.265/HEVC standard that enables the partition of the picture into independently decodable regions with some shared header information [6]. Before the appearance of tiles, H.264/AVC’s Flexible Macroblock Ordering (FMO) [7] could be used instead to distribute heterogeneously the quality in the image. However, this tool was neither efficient nor widely implemented. Regarding transmission, HTTP/TCP-based adaptive bit rate (ABR) streaming techniques [8] are commonly used to deliver omnidirectional video, due to their adaptability. In this scheme, content is encoded at different resolutions and bit rates and divided temporally into self-contained segments of equal duration that invariably start with an Instantaneous Decoding Refresh (IDR) frame. Therefore, the segment that best suits both the system state (channel available bandwidth, terminal capabilities…) and the viewer’s PoG at a particular moment is delivered to the client to be decoded and presented to the user. The configuration used for certain parameters, such as the number of partitions, the encoding parameter values in each image subdivision or the segment length, and the transmission scheme will influence decisively in several aspects of the system: quality provided and perceived by the user based on his/her behavior, bandwidth used, storage needs, intelligence requirements in different elements of the system, etc.

The majority of the proposed methods are evaluated considering only the bandwidth they save. For example, Ghaznavi-Youvalari et al. [9] compare the bandwidth saved using different strategies: SHVC, regions of interest (RoIs), etc. In the work proposed by Hosseini et al. [10], the bandwidth savings of two different proposals is computed with respect to the non-viewport-adaptive version of the content. Furthermore, Zare et al. [11] compare the compression and bitrate performance of different grid tiling sizes. However, one key problem of viewport-adaptive approaches is not strictly related to bandwidth consumption but with the need for a system that adapts quickly to the movements of the user. That is, it is essential that the high quality viewport that corresponds to the new position of the user is presented as soon as possible. Otherwise, the user will perceive low quality areas that will decrease his/her QoE. Therefore, many strategies use short segments and small buffers. However, the drawback is that short segments imply lower coding efficiency [12], so a very low quality or even black areas [13] are necessary to be able to save bandwidth. Instead, there are some authors [14] who propose using several streams with shifted IDR frames to allow quick switching between versions prepared for different viewports and representations whereas keeping a large IDR period. However, this proposal has its own disadvantages, such as a greater complexity at the client side and a larger number of versions of the content at the server side. In either case, these changes are still not instantaneous as they require the download and playback of a new IDR, so the user could perceive low quality areas if he/she moves quickly. Furthermore, the size of the high quality areas influences significantly the QoE: the larger the areas are, the lower the probability of seeing low quality areas will be, but also the lower the bandwidth saving for an equivalent quality in the viewport.

Therefore, whichever the implemented scheme is, to measure its performance beyond the bandwidth required to transmit the content, we need a way to compute the actual quality provided to the user. In this way, we can test the design of the strategy and, when appropriate, improve it by fine-tuning the values of the parameters that characterize it. In this respect, some works measure the quality within the viewport using objective metrics, either the original version or a 360VR-aware one. In particular, Ozcinar et al. [15] calculate both the PSNR and SSIM [16] metrics inside the viewport, whereas Timmerer et al. [17] use the V-PSNR metric to determine the quality seen by the users over time. Others, such in the proposal by Xie et al. [18], estimate the perceived quality considering the results of subjective tests performed previously. These experiments are focused on collecting opinion scores on different variations of the content and the resulting mean opinion score (MOS) is used to model the impact on the quality variations. Finally, there is yet another group of works that use approximations in order to determine the quality that is really perceived by the users along the session. For example, Petrangeli et al. [19] use the percentage of time that the user looks to the high quality areas.

However, these proposals do not solve properly the requirements for adequately assessing quality of omnidirectional content. Either their viewport projection is inaccurate or its implementation is not detailed enough. Hence, we present a detailed methodology, named VAQM (Viewport Adaptive Quality Method), to accurately calculate the viewport projection on the equirectangular image and, thus, to enable its use in every scheme looking for an overall quality metric value on the image seen by the user. As any standard metric (e.g. PSNR, SSIM, VMAF, MOS-related…) can be used, a complete solution for the objective quality assessment is provided. Additionally, we also provide a simplified version of the procedure for operation under strict computing time restrictions. So, certain approximations are used for computing the quality, such as using a set of pre-calculated viewport projections.

The rest of the paper is organized as follows. First, in Section II, the viewport projection is explained in detail. Then, in Section III, the procedure to obtain a figure of merit reflecting the quality perceived by the user is presented. In Section IV, we describe the full method and the approximated version developed to obtain the quality of the session. The description of the experiments carried out to assess the performance of the method and its results and the corresponding analysis are included in Section V. Finally, the paper is concluded in Section VI.

II Viewport projection

Let us consider the coordinate system used for the definition of the PoG, the FoV, and the spherical rectangle, projection of the viewport on the sphere. Figure 2 presents the Cartesian and spherical coordinates, where the origins of the two spherical angles follow the usual choices for HMDs .

Therefore, the transformation between them is described by the sets of equations:

[TABLE]

Several planar mappings of the sphere (also called projections) have been considered for the representation of 360 video content: equirectangular, cubemap, pyramidal, equiangular, cubemap… [20, 21, 22]. Since no map of the sphere to the plane can be both conformal and area-preserving, each mapping affects the quality of the different areas of the 360 video content in a different way. Among the different projections that are used, the most common mapping is the equirectangular projection and, therefore, it is the one considered here. Its main advantage lies in the simple transformation equations between the spherical and the planar coordinates. Assuming that the upper-left corner of the frame is the origin of the planar coordinates, the values of $\theta$ and $\phi$ can be scaled directly to obtain their planar counterparts. Therefore, if the 360 video contents are represented in a $N_{H}\cdot N_{V}$ frame, the pixel coordinates of point on the sphere $(\theta,\phi)$ are $((\theta/360)N_{H},(\phi/180)N_{V})$ and, thus, we will keep the $(\theta,\phi)$ addressing for the equirectangular plane. Nevertheless, it must be stressed that any other projection can be considered, as the methodology proposed here can be applied on any geometry.

II-A FoV, viewport, and projected viewport

Let us consider that the FoV is a right rectangular pyramid whose vertex is located in the center of the viewing sphere of the HMD. Thus, the viewport on the sphere is a spherical rectangle centered around the PoG and each one of its four sides is a great circle arc. If the FoV is defined by its two dihedral angles $(\theta_{\text{VP}},\phi_{\text{VP}})$ , the solid angle $SA$ subtended at the center of the sphere by the FOV is [23]:

[TABLE]

Any value of $(\theta_{\text{VP}},\phi_{\text{VP}})$ is acceptable. However, to help explain the procedure and perform the associated experiments, we have chosen $(\theta_{\text{VP}},\phi_{\text{VP}})~{}=~{}(100^{o},85^{o})$ since they represent the average value of the FoV parameters found in the most common HMDs. Thus, the solid angle is 2.15 steradians, roughly 1/6 of the surface of the sphere, which implies a large deformation, where the four spherical angles are clearly larger than $90^{o}$ . Therefore, linear approximations, commonly employed in the literature, cannot be used.

Although the shape of the projected viewport in the equirectangular image varies significantly according to the location of the viewport on the sphere, let us remember that the size (subtended solid angle) of the viewport is constant. So, it is useful to obtain this constant value in pixel related units. If the sampling rate at the equator of the equirectangular image is taken as reference, each one of the pixels lying on the equator covers a unit area, either in the equirectangular image or on the sphere. However, pixels located outside the equator cover less area on the sphere as the sampling rate along parallels increases with the latitude.

As said before, the origin of coordinates of the equirectangular image is located in the upper-left corner, as shown in Figure 3. The area covered by each pixel in the above mentioned area units is $a(\theta,\phi)=\sin{\phi}$ . We call the equivalent number of pixels the area of the region expressed in these area units. Thus, the equivalent number of pixels of the whole frame is $N_{\text{picture}}$ :

[TABLE]

As the whole frame covers the whole surface of the sphere, the solid angle subtended is $4\pi$ steradians. Thus, the equivalent number of pixels of the viewport, $N_{\text{viewport}}$ , can be obtained as a proportion of the solid angles subtended:

[TABLE]

This result should be rounded if an integer value is required.

Furthermore, this expression of the equivalent number of pixels of the viewport sets the maximum effective resolution that can be achieved by the HMD display. As an example, for the usual values considered in this paper, $(\theta_{\text{VP}},\phi_{\text{VP}})~{}=~{}(100^{o},85^{o})$ , $N_{H}N_{V}=3840~{}\text{x}~{}1920$ , we obtain $N_{\text{viewport}}=802871$ pixels, clearly lower than 1 Mpixel.

II-B Procedure for computing the projected viewport

Looking for a simpler set of operations to obtain the shape of the projected viewport, we decompose the computation of the projected viewport around the PoG of the user into a three-step procedure. First, we consider a base viewport centered on the central point of the equirectangular image and compute its vertices and several points along its four sides that will help define a piecewise linear approximation of those sides. Then, we rotate this set of points to place them around the PoG of the user. Finally, we obtain the desired projection by connecting those points, thus generating a closed region. Pixels within the boundaries marked by these connections belong to the projection and the set of these pixels is called the mask. These steps are explained in detail below.

II-C Base projected viewport

Every projected viewport is characterized by the location of its four spherical vertices. As stated, we first determine those of the base viewport, which correspond to the initial viewing experience. Throughout the paper and in our experiments, we assume that the user begins looking at the center of the equirectangular image, which corresponds to $O=(180^{o},90^{o})$ . However, the proposed methodology can be adapted to any other desired initial PoG located at the equator, due to the special features of the base viewport that are described below.

As the base viewport is centered on the equator, the two vertical sides follow two meridians. However, the two horizontal sides do not follow any parallel. The analysis begins determining the coordinates of the middle points of the four sides of the projected viewport ( $E$ , $F$ , $G$ and $H$ in Figure 3):

[TABLE]

As the viewport sides $AC$ and $BD$ follow two meridians, they are projected as vertical straight lines on the equirectangular image. Therefore their abscissas are equal to those of their midpoints $G$ and $H$ respectively. However, the same does not apply for the other two lines $AB$ and $CD$ , requiring the analysis of the projection of the viewport on the sphere to obtain the values of their ordinates:

[TABLE]

Let us now consider the location of the points along the sides. Thus, let $AB$ be a great circle arc, $(x_{A},y_{A},z_{A})$ the Cartesian and $(\theta_{A},\phi_{A})$ the spherical coordinates of point $A$ , and $(x_{B},y_{B},z_{B})$ the Cartesian and $(\theta_{B},\phi_{B})$ the spherical coordinates of point $B$ . Additionally, let $L$ be another point in the same great circle arc defined by $A$ and $B$ on the unit sphere and let $(x_{L},y_{L},z_{L})$ be its Cartesian and $(\theta_{L},\phi_{L})$ its spherical coordinates. Then, since the three points belong to the same great circle arc, and so to the same plane, the determinant of the matrix built with their Cartesian coordinates is zero:

[TABLE]

Solving the equation and taking into account that the point $L$ lies on the surface of the unit sphere, the equation of the line joining the two vertices $A$ and $B$ is:

[TABLE]

where

[TABLE]

We sample the equation for several $\theta_{L}$ values to obtain their corresponding $\phi_{L}$ . The resulting $L$ points will be connected later using a piecewise linear function. Therefore, depending on the number of values used, the line approximation will be coarser (low number of points) or more accurate (high number of points). Figure 4 shows the results obtained in this step.

II-D Rotation of central viewport

To simplify the operations in the second step, the viewport is first rotated along $\phi$ and then along $\theta$ . Roll movements are not considered, since they are extremely small.

For the rotation along $\phi$ , as the central viewport is so far assumed to be centered in $(0,-1,0)$ , it is performed about the $-x$ axis, as can be seen in Figure 5.

Therefore, the corresponding rotation matrix is:

[TABLE]

The second part of the rotation is performed as follows. Once the viewport has been rotated vertically, it is moved horizontally on the equirectangular image. In Figure 6, the points of the viewport in (a) are rotated $\theta$ degrees horizontally, obtaining the points shown in (b). More examples of the results at the end of this second step are shown in Figure 7.

II-E Mask generation

After the first two steps, the obtained points are joined with straight lines, that is, using a piecewise linear function, generating a closed region. This closed region is then filled to obtain the desired mask. Different possible situations must taken into account to correctly identify the region of the image within the mask. For example, Figure 8 presents an special where the mask is not totally connected, but divided in two parts due to circular shifts. Additional examples of the obtained masks are shown in Figure 9.

At this point, we have a binary mask, $M_{i}$ , where the non-zero elements represent the viewport projection on frame $i$ . However, to compensate for the unequal sampling of the sphere by the equirectangular projection, the values of the elements in $M_{i}$ are weighted according to the area they cover on the sphere. Each of these weights depends exclusively on the latitude and its value is equal to the sine of its corresponding $\phi$ value ( $w^{a}_{j}=\text{sin}~{}\phi_{j}$ ).

[TABLE]

Furthermore, the values of the corresponding pixels can be weighted additionally to provide more importance to the more relevant pixels within the viewport, i.e. the central area with respect to the viewport edges. In this case, the pixels in the viewport are weighted considering the distance to the PoG $w^{c}_{j}=f\left(d\left(j,j_{\text{PoG}}\right)\right)$ and normalized accordingly.

III Proposed Methodology

Viewport adaptation implies that the quality is not uniformly distributed on the image, but is composed of areas of different qualities. Figure 10 shows the difference between both non-viewport-adaptive and viewport-adaptive methods, where LQ, MQ and HQ correspond to low quality, medium quality and high quality, respectively. Although the figure presents only two qualities for the non-viewport-adaptive method, more qualities can be used as long as the bitrate is preserved. This non-uniform quality distribution enables that users may observe more than one quality at the same time. The information about the corresponding quality of each of the areas of the image is represented by a grade matrix $V$ , where each entry represents the quality value of one pixel of the image. As mentioned in the introduction, these non-uniform distributions can be implemented thanks to the use of tiles, since each of them may have a different quality.

The main idea of the proposed methodology is to provide a figure of merit that reflects the quality really perceived by the user along a temporal window. Therefore, we need a representative value of the quality seen at each frame within this temporal scope. To that end, we define the quality $q_{i,p}$ of each pixel $p$ as the product of a geometric-related component, $m_{i,p}$ , and a grade-related one, $v_{i,p}$ . The matrix $M_{i}$ , outcome of the previous section as the mask representing the viewport projection, contains the geometric-related component of each pixel at frame $i$ . A second matrix, $V_{i}$ , contains the grade-related components of each pixel of the image of the viewport-adaptive content presented to the user at that particular moment. Thus, the resulting matrix $Q_{i}$ can be formulated as the Hadamard product of the geometric-related and the grade-related matrices:

[TABLE]

A spatial quality pooling can be obtained for every frame and a temporal quality pooling can be computed to obtain an overall quality figure for the considered temporal window.

Regarding the spatial pooling, the quality value $q_{\text{frame},i}$ for frame $i$ is computed as the average of all the quality values within $Q_{i}$ of the pixels in the viewport projection,

[TABLE]

where $N_{\text{viewport}}$ is the equivalent number of pixels of the viewport obtained in the previous section, and $q_{i,p}$ is the element in matrix $Q_{i}$ representing the quality of pixel $p$ . Since all the values outside the viewport projection are null, the summation can be extended to the whole image, delivering the same value:

[TABLE]

The proposed methodology up to the output of the spatial quality pooling is illustrated in Figure 11.

Regarding the temporal pooling, we have defined two different approaches: the mean and the fraction of time above a threshold. The first one reflects the average quality shown to the user during a temporal window of the streaming session and is obtained by a uniform or weighted average of the spatial quality over time, leading to $q_{\text{window}}$ ,

[TABLE]

where $N_{f}$ is the number of frames in the analyzed temporal window. The second approach gives a figure of the fulfillment of a minimum spatial quality along the analyzed window and is computed as the percentage of frames with a quality value higher than a threshold $T_{\text{Q}}$ :

[TABLE]

where $\left[P\right]$ is the Iverson bracket, i.e. 1 if $P$ is true and 0 otherwise. The threshold $T_{\text{Q}}$ can be set to any specific value between 0 and 1 that designers decide better suited to their goals. The greater $T_{\text{Q}}$ is, the more strict the imposed requirements are in terms of maintained quality over time, taking into account the specifics of the session (type of content, user behavior, setup…).

There are two approaches for the quality pooling. On the one hand, if the objective is to evaluate the proportion of high quality area that is presented to the user along the considered temporal window, the entries of matrix $V_{i}$ must be set either to one, if they belong to the high quality area, or to zero, otherwise. On the other hand, if the objective is to look for an objective assessment, any metric that provides a value per pixel can be used for populating matrix $V_{i}$ .

Additionally, non-pixel-based objective metrics such MSE, PSNR, SSIM or VMAF, which provide a single value per frame, can be used by computing $V_{i}$ on a per area basis. In this case, we can exploit the fact that the equirectangular image can be divided into tiles for encoding and therefore we can apply the desired technique to each of them individually or to sets of them (Figure 12). In this way, all the pixels belonging to the same tile or set of tiles will have the same value in the grade matrix. Nevertheless, although the matrix $V_{i}$ is computed in a different way, the proposed methodology is maintained.

IV Methods

The methodology presented in the previous section requires the computation of matrices $M_{i}$ and $V_{i}$ for all the frames in the session. However, computing requirements might be a burden for lightweight real-time applications. Thus, we have defined two methods: the full method, called Viewport Adaptive Quality Method (VAQM), and a lighter version, called Approximated Viewport Adaptive Quality Method (AVAQM), where matrices $M_{i}$ and $V_{i}$ are selected from pre-computed sets of masks and grades to speed up the process. Both methods are described next.

IV-A VAQM (Viewport Adaptive Quality Method)

The scheme followed in the full method is shown in Figure 13. During the session, we constantly collect information about the content that the user is watching and the PoG of the user at each moment. Afterwards, the viewport projection is computed for all the collected samples, obtaining a new mask for each instant of time. Moreover, the matrix $V_{i}$ is generated from applying the desired quality metric (FR or not) to the whole image (e.g. MSE, PSNR, SSIM or VMAF). Thus, this matrix is of the same size as the transmitted video images. In summary, with this approach, we obtain a very high accuracy in exchange for a greater computational cost.

IV-B AVAQM (Approximated Viewport Adaptive Quality Method)

This approach arises from the fact that there are some scenarios where time restrictions may not allow us to apply the previous method directly, as, depending on the resources, it could be computationally costly. Additionally, it is also valid for when accuracy requirements are more flexible. The scheme followed by this approach is shown in Figure 14.

Approximations might be carried out independently on two fronts: in the geometric-related part of the procedure to generate matrix $M_{i}$ , and in the grade-related part to obtain $V_{i}$ .

Regarding the geometric part, $M_{i}$ can be approximated using a finite set of pre-calculated masks with centers uniformly distributed throughout the equirectangular image, as represented by the yellow circles in Figure 15. The selected pre-calculated mask for frame $i$ , $M^{\prime}_{i}$ , (in green in the same figure) is the one whose center $\left(\theta^{\prime}_{i},\phi^{\prime}_{i}\right)$ is the nearest to the projected PoG of the user $\left(\theta_{i},\phi_{i}\right)$ (in red in the same figure). On the plus side, the use of the geometric approximation hugely decreases the computation load required to perform the viewport projection. On the down side, there is a certain loss of accuracy and a slight increase of storage requirements. Both drawbacks heavily depend on the number of pre-calculated masks used.

With respect to matrix $V_{i}$ , the entries of its approximation $V^{\prime}_{i}$ can be computed considering the value of encoding parameters that do not express the resulting image quality but indirectly provide a sufficiently accurate idea of it, like the Quantization Parameter (QP) used to encode the basic processing units (e.g. Coding Tree Units -CTUs- in H.265/HEVC) in the image. In this particular case, lower values correspond to better qualities. The advantage of this approximation relies on the ease and speed when obtaining and mapping such values. The drawback is the gap between these values and any others resulting from the application of a given quality metric in terms of capacity upon representing the quality perceived by users.

Finally, the approximated matrix $Q^{\prime}_{i}$ is computed in an analogous way as before as the Hadamard product of matrices $M^{\prime}_{i}$ and $V^{\prime}_{i}$ :

[TABLE]

We include in Table I the mean relative error that results from using different numbers of pre-calculated masks. The study has been carried out with videos of resolution UHD-4K (3840x1920p). Moreover, the QP value used to encode a set of pixels is used as the quality value of these pixels. Nevertheless, this last feature has no real impact on the results, since, as mentioned, the study aims at measuring and registering relative values. To help understand these values, please remember that the viewport covered around 1/6 of the surface of the sphere.

Based on the results shown in the table, it can be concluded that the approximation with 10x20 masks constitutes a good trade-off between accuracy (mean relative error of less than 1%) and storage requirements.

V Experiments and results

The methodology explained in the previous section has been tested through two main sets of experiments. The first one considers the effect of the length of the video segments, whereas the second one is focused on the impact of the movements of the user. Before describing them in depth, we present the test features that are common to them.

The videos used in the experiments come from the public database 360 Video Viewing Dataset in Head-Mounted Virtual Reality [24]. This database includes 10 one-minute-long videos of 3840x1920 pixels at 30 fps. For each source, it also includes the trajectory followed by 50 different users at one sample per frame. Regarding the content preparation, the sequences were H.265/HEVC encoded with 5x8 tiles, a trade-off between coding efficiency and sufficient granularity to generate efficient viewport-adaptive content. Each viewport-oriented sequence, that is, the version of the content created to be presented to the user when his/her PoG lies in a specific area of the sphere, is the result of encoding the source content with a given distribution of qualities. Each of these sequences is associated with one of the non-overlapping areas in the equirectangular image depicted in Figure 16, where the upper and lower stripes of tiles have been merged into one area, respectively. Thus, we have generated 26 viewport-oriented sequences per video source, which are later on segmented to be used in an ABR platform. Additionally, we assume that virtually the whole motion-to-photon latency of the system corresponds to the time that it takes the segment currently in the process of decoding and presentation (which corresponded to the previous PoG) so that the one that is correctly adapted to the PoG can start the same process. Thus, we assume that the time required to download segments is negligible. Furthermore, as a consequence, there are no stalls.

Regarding the configuration related with the proposed methodology, the experiments have been performed using the approximated version of the methodology, that is, with pre-calculated masks for the viewport projections and a set of pre-calculated qualities for each of the areas in the image. As said before, we have assumed a FoV of 100o horizontally and 85o vertically. Based on the results shown in Table I, we use 10x20 pre-calculated masks. Finally, for the set of pre-calculated qualities, we have considered the use of the values 1 and 0 for the high and low quality, respectively. Four examples of the distribution of both values in the areas in as many viewport-oriented sequences is depicted in Figure 17.

Finally, the timeline considered for each session is that of the duration of the sequence presented to the user. Therefore, the window for the temporal pooling comprises 1800 frames (one minute at 30 fps). Furthermore, besides the average quality provided within the viewport throughout the considered temporal window ( $q_{\text{window}}$ ), we assess the quality of the session computing the percentage of frames with a quality value higher than 80% of the maximum quality ( $f_{\text{window}}$ imposing $T_{\text{Q}}=0.8$ ).

V-A Impact of the segment length

For the first set of experiments, the described methodology has been used to compute the observed quality over time during a specific session as a function of the segment length. To that end, we have simulated the use of segments of three different lengths: 500 ms, 2000 ms and 6000 ms. Figure 18 shows the quality presented to the same user (user 32 in the dataset) visualizing the same content (content ’game’ in the dataset) if different segment lengths were used. Additionally, the results of the two temporal quality pooling approaches proposed in Section III are included in Table II.

As can be observed, the longer the segment length, the lower the average quality, and the lower the percentage of time above the threshold. This is due to the slower adaptation of the viewport-oriented content to the new user’s PoG.

V-B Impact of the amount of exploration

The amount of exploration, and thus of variation of the user PoG, during a session depends basically on two factors: the type of content (how driven or exploratory it is) and the user’s own nature (how active or calm he/she is when viewing 360VR content). Therefore, the analysis of these elements is key to designers to properly build and tune encoding and transmission strategies to provide QoE.

In this subsection, we use the proposed method to accurately study this dependency. First, we analyze the degree to which the nature of the content boosts exploration across the equirectangular image, which, as mentioned, can notably impact the quality perceived by users in viewport-adaptive schemes. To that end, we compute the spatial quality pooling along time obtained for a number sessions. Each session consists in the same user watching a different content. We have selected three sessions with rather different contents on the driven/exploratory dimension. The results for these sessions are included in Figure 19. Furthermore, we have included the aggregated spatial and temporal quality pooling to provide a global quality value per session. These results are included in Table III.

As expected, the more exploratory it is the content presented to the user, the more it encourages him/her to move, which causes more quality changes, as reflected in the figure. This moves result in a lower average quality, and a lower percentage of time above the threshold, as shown in the table.

Next, we study the influence of the user’s behavior in the degree of exploration. So again we compute the spatial quality pooling along time obtained for a number of sessions. Each of these session consists in the content presented to a different user. Figure 20 depicts the results for three users that could be classified as curious, medium and quiet. Furthermore, we have included the aggregated spatial and temporal quality pooling to provide a global quality value for each of these sessions. These results are included in Table IV.

The results show that the more active the user is, the more he/she moves, and so the more quality changes along the session. As before, this type of session correlates with a lower average quality, and a lower percentage of time above the threshold.

V-C Summary of results

Finally, we include the average aggregated temporal and spatial quality pooling per content and per segment length in Table V. As expected, on average, there is a clear difference between the results obtained for different segment lengths, with shorter segments allowing a better adaptation and so a better global quality.

Focusing on the shortest segment length, 500 ms, we can observe that the values obtained after performing the temporal quality pooling do not vary much with the content. The reason is that the use of very short segments provides quick adaptation, thus preventing the user’s PoG from moving much from its position at the beginning of the segment, when it was last updated, regardless of the nature of the content and, as a matter of fact, of the behavior of the user. However, the longer the segments used in the session, the lower the quality perceived by users on average and the higher the variations between contents. The overall quality drop is a consequence of the increasing time the user has to explore the scene before the viewport is updated, regardless of the content. The variability in the quality drop amplitude reflects the influence of the characteristics of the content in the amount of exploration. In this respect, the more exploratory the presented content, the lower the temporal quality pooling. Both tendencies reinforce the analysis presented in previous subsections.

VI Conclusions

The accurate assessment of the quality perceived by users throughout a 360VR video visualization session is key in the design of robust specific encoding and transmission strategies. In particular, the strict requirements to provide 360VR content with good quality have led to the development of many different viewport-adaptation strategies aiming at offering the best possible quality while saving bitrate. To properly evaluate these schemes, not only the saved bitrate, but also the quality of the portion of the scene actually presented through the HMD at all times should be consider. In this paper, we have proposed a methodology to accurately assess the quality inside the viewport around the user’s point of gaze at every moment. This methodology has been made possible thanks to a complete analysis of the geometric relations involved in this particular environment, also detailed in the paper.

The proposed procedure is highly flexible and allows for any trade-off between accuracy and computational load. This is done by selecting the degree of approximations that best suits the specific requirements of the scenario. These options enable the use of the proposed methodology both offline and online, depending on the needs of the system.

Finally, we have shown its operation through a set of descriptive experiments. In particular, we have tested the effect of different essential factors on the observed quality, such as the length of the segments and the amount of movement of the user along the session. The analysis of the results validates the capability of the proposed methods to assess the quality perceived by users from different perspectives.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Freina and M. Ott, “A literature review on immersive virtual reality in education: State of the art and perspectives.” e Learning & Software for Education , no. 1, 2015.
2[2] J. H. Seo, B. M. Smith, M. Cook, E. Malone, M. Pine, S. Leal, Z. Bai, and J. Suh, “Anatomy builder vr: applying a constructive learning method in the virtual reality canine skeletal system,” in International Conference on Applied Human Factors and Ergonomics . Springer, 2017, pp. 245–252.
3[3] K. Brunnström, M. Sjöström, M. Imran, M. Pettersson, and M. Johanson, “Quality of experience for a virtual reality simulator,” in Human Vision and Electronic Imaging (HVEI), Burlingame, California USA, 28 January-2 February, 2018 , 2018.
4[4] R.S. Allison, K. Brunnström, D.M. Chandler, H. Colett, P. Corriveau, S. Daly, J. Goel, J.Y. Long, L.M. Wilcox, Y. Yaacob, S.-N. Yang and Y. Zhang, “Perspectives on the definition of visually lossless quality for mobile and large format displays,” Journal of Electronic Imaging , vol. 27, no. 5, pp. 1–23, Oct. 2018.
5[5] R. G. de A. Azevedo, N. Birkbeck, F. De Simone, I. Janatra, B. Adsumilli and P. Frossard, “Visual distortions in 360-degree videos,” ar Xiv:1901.01848 , Jan. 2019.
6[6] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
7[7] P. Lambert, W. De Neve, Y. Dhondt, and R. Van de Walle, “Flexible macroblock ordering in h.264/avc,” Journal of Visual Communication and Image Representation , vol. 17, no. 2, pp. 358–375, 2006.
8[8] A. Bentaleb, B. Taani, A. Begen, C. Timmermer, and R. Zimmermann, “A Survey on Bitrate Adaptation Schemes for Streaming Media over HTTP,” IEEE Communications Surveys & Tutorials , 2018, Early Access.