Overt visual attention on rendered 3D objects
Oleksii Sidorov, Joshua S. Harvey, Hannah E. Smithson, Jon Y., Hardeberg

TL;DR
This study investigates how different material appearances of 3D objects influence visual attention, using eye-tracking and a novel gaze projection technique to analyze fixation patterns on rendered surfaces.
Contribution
We introduce a new method for projecting gaze fixations directly onto 3D object surfaces and demonstrate its effectiveness in studying material-dependent attention.
Findings
Material appearance significantly affects visual attention patterns.
The novel gaze projection technique improves accuracy of attention map visualization.
Different materials like glossy, matte, and gold alter fixation distributions.
Abstract
This work covers multiple aspects of overt visual attention on 3D renders: measurement, projection, visualization, and application to studying the influence of material appearance on looking behaviour. In the scope of this work, we ran an eye-tracking experiment in which the observers are presented with animations of rotating 3D objects. The objects were rendered to simulate different metallic appearance, particularly smooth (glossy), rough (matte), and coated gold. The eye-tracking results illustrate how material appearance itself influences the observer's attention, while all the other parameters remain unchanged. In order to make visualization of the attention maps more natural and also make the analysis more accurate, we develop a novel technique of projection of gaze fixations on the 3D surface of the figure itself, instead of the conventional 2D plane of the screen. The proposed…
| a: Rough | a: Smooth | a: Coating | |
|---|---|---|---|
| b: Smooth | b: Coating | b: Rough | |
| 11.41 | -6.39 | -2.53 | |
| CC(a,b) | 0.348 | 0.328 | 0.349 |
| SIM(a,b) | 0.431 | 0.418 | 0.435 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Visual perception and processing mechanisms
Overt visual attention on rendered 3D objects
Oleksii Sidorov
The Norwegian Colour LabNTNUGjøvikNorway2815Teknologivegen 22
,
Joshua S. Harvey
,
Hannah E. Smithson
Dept. of Experimental PsychologyUniversity of OxfordOxfordUK
and
Jon Y. Hardeberg
The Norwegian Colour LabNTNUGjøvikNorway2815Teknologivegen 22
(2019)
Abstract.
This work covers multiple aspects of overt visual attention on 3D renders: measurement, projection, visualization, and application to studying the influence of material appearance on looking behaviour. In the scope of this work, we ran an eye-tracking experiment in which the observers are presented with animations of rotating 3D objects. The objects were rendered to simulate different metallic appearance, particularly smooth (glossy), rough (matte), and coated gold. The eye-tracking results illustrate how material appearance itself influences the observer’s attention, while all the other parameters remain unchanged. In order to make visualization of the attention maps more natural and also make the analysis more accurate, we develop a novel technique of projection of gaze fixations on the 3D surface of the figure itself, instead of the conventional 2D plane of the screen. The proposed methodology will be useful for further studies of attention and saliency in the computer graphics domain.
Attention, saliency, eye-tracking, visualization, material, appearance, perception, gaze
††copyright: acmcopyright††journalyear: 2019††conference: The 12th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia; November 17 – 20, 2019; Brisbane, Australia††booktitle: The 12th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia, November 17 – 20, 2019, Brisbane, Australia††isbn: ††submissionid: ††ccs: Human-centered computing Visualization††ccs: Computing methodologies Computer graphics††ccs: Computing methodologies Perception
1. Introduction
Research on human attention and visual saliency has gained wide popularity due to the numerous tasks where it can be applied and produce useful results (Jacob and Karn, 2003)(Bergstrom and Schall, 2014)(Rayner et al., 2001)(Cowen et al., 2002). Such studies are particularly important for multimedia and graphic design. Measurement and prediction of the areas where the observer is likely to look helps to create content oriented specifically to the human visual system and allows low-level cognitive processes to be considered, resulting in more inclusive and efficient design. In addition, the study of saliency in computer graphics (CG) is important for understanding fundamental psychological aspects of human-computer interaction. Although the gap between rendered and natural scenes is being reduced very rapidly, in most cases, visual perception of computer graphics is still significantly different from that of natural scenes. This inevitably influences the distribution of attention which in turn complicates the transfer of well-known psychological principles from real-world to synthetic environments.
In this work, we develop a new methodology of more efficient projection and visualization of gaze-fixations on rendered 3D meshes and demonstrate how it can be applied on the example of material appearance. The classical eye-tracking work of Yarbus (Yarbus, 1967) illustrates how gaze direction is affected by the task given to the observer (or question asked), so we also perform the experiment under a set of different tasks, and analyze their impact on overt visual attention.
The paper is organized in the following way: Section 2 presents existing works on the study of attention in 3D as well as studies on material perception, Section 3 describes the theory behind the newly developed methodology, Section 4 presents the experimental setup and rendering parameters, and Section 5 reports eye-tracking results from the experiment and their analysis.
2. Related works
The novelty of our work is emphasized by the fact that there are no previous works in which authors study the same questions we try to answer. Very few recent works are dedicated to overt visual attention on 3D objects. (Leek et al., 2012) conducted a study of eye movement patterns on 3D meshes of abstract shapes in the context of studying shape perception and the influence of different features such as edges, curvatures, and concavities. Contrasting with our work, the stimuli used in that study were simple geometric shapes rendered in MatLab, and eye-tracking results were analyzed in 2D. Recent work of (Wang et al., 2018) was aimed at assessing visual attention on the surface of real 3D objects, not CG renders. Moreover, the goal of the work was to create a large public dataset of gaze-fixations on 3D objects and to train predictive machine-learning algorithms (the resulting saliency maps were two-dimensional). Similarly to our approach, these authors project fixation maps directly on the surface of the 3D object. In the case of real objects, this process transforms into a classical task of 3D reconstruction which can be solved accurately when the geometry of the experiment is known and the system is calibrated (that is the case in (Wang et al., 2018)). However, in the case of CG, when objects are ”beyond” the screen and geometry cannot be measured such an approach cannot be used.
Several other eye-tracking studies on 3D renders have been conducted: (Howlett et al., 2005) measured gaze fixations on low-resolution models (5000-8000 faces), (Kim et al., 2010) evaluated the performance of 2D saliency prediction algorithms on high-res 3D models (projected onto a plane), and (Mantiuk et al., 2013) extended the work to include experimental stimuli that were moving 3D animations. The closest match with our method can be found in work of (Lavoué et al., 2018). The goal of the work is to create a dataset of 3D renders for benchmarking of computational algorithms of saliency prediction. As part of this endeavour, the authors perform a comprehensive study of attention under different conditions, including static and dynamic scenes. They also visualize saliency maps on the rendered 3D surfaces. According to the authors, in order to determine which surface point on the 3D mesh corresponds to the fixation pixel, they compute the ray emitted by the camera pinhole that passes through the pixel on the image, and then calculate the closest point of intersection with the 3D surface. However, it was not disclosed how the coordinates of the intersection point on a plane were translated to the three-dimensional point cloud. Doing this requires the availability of a mapping function in analytical form or a corresponding lookup table (LUT) which may substitute this function. We demonstrate specifically how one can estimate such a LUT and eventually perform projection on the mesh.
Existing work on material appearance is rather sparse, largely because it represents a rather novel research field. Color is not an exclusive characteristic that defines material perception; instead, objects may also be characterized in terms of glossiness, translucency, transparency, roughness, bumpiness, etc. A number of works have been directed towards studying the low-level visual features that support material perception (Anderson, 2011)(Landy, 2007)(Fleming, 2014). Moreover, in the majority of the cases, computer graphics is used as a tool to simulate gradation of particular appearance parameters such as glossiness or transparency. Other works have also investigated the neural processes involved in material perception. In particular, the cortical regions responsible for the perception of glossiness were studied independently by (Kentridge et al., 2012) and (Wada et al., 2014). A comprehensive review of advances in glossiness perception can be found in (Chadwick and Kentridge, 2015). (Toscani2013lightness) have shown a causal link between gaze behaviour and lightness judgements, and that fixations on intense pixels in rendered objects are particularly informative about surface reflectance. However, to the best of our knowledge, there are no studies of visual saliency in relation to object’s appearance and glossiness in particular. (Qi et al., 2018) collect fixation data from different materials with the aim of predicting saliency maps followed by material classification in computer vision, while (Leonards et al., 2007) demonstrate how materials of different appearance are used in cultural heritage to direct the observers’ attention. These works are also good examples of applications of saliency studies in computer graphics – they are crucial for good quality simulation of particular materials, as well as for controlling observers’ gaze and emphasizing certain regions in a rendered scene.
3. Projection and visualization of fixations maps
Let us consider a 3D-scene consisting of the points . Rendering the scene projects all the points of to the image plane: , . The eye-tracker produces output as a set of fixation points in the 2D plane of a screen which correspond to fixations on pixels of a rendered scene with the same location (under the condition that eye-tracker is calibrated). It may also be interpreted as the intersection point of a ray from a camera pinhole with an image plane. Therefore, the task is to find a function which re-projects fixation coordinates to a 3D space of the initial scene and results in 3D cloud of fixations wrapped around the scene’s mesh.
If the projective function is known in the analytical form, the task simplifies to a trivial task of finding an inverted function (which, however, may not exist). In all other cases, should be estimated. Considering that the number of the points is finite, may also be replaced with a lookup table without losing any information. Next, we propose one of the possible solutions of this task by computing a lookup table using three-dimensional color-encoding of the coordinates.
The approach is based on the dimensionality of RGB color space, which is equal to the dimensionality of Euclidean space, and that therefore allows translating coordinates between them. The limitation is that RGB color space is finite (0-255 or 0-65535 depending on bit-depth), unlike Euclidean space. However, rendered 3D-scenes also occupy a strictly defined volume, which allows them to be scaled linearly to fit the limits of RGB space:
[TABLE]
considering that RGB space is a 16-bit cube (which allows spatial resolution of ). This operation forms the basis of the next step of rendering the additional map of the scene with a texture that contains information about 3D-coordinates as pixel color. This also can be interpreted as a 2D LUT. Therefore, a pixel’s coordinates on an image correspond to a unique color vector which can be used to recover 3D information:
[TABLE]
where and are scaling coefficients for the given scene stored separately (translation is omitted). In this case, is a conventional RGB image which can also be transformed to sRGB for visualization (Fig. 2a). In our experiments, maps were created using Cycles render engine with custom ”Emission” material of a surface configured according to (1). It is worth noting that only the points captured by a single camera view can be reconstructed using one image, similarly to real-world 3D reconstruction. Reconstruction of a whole 360*∘* point cloud requires multi-view geometry. For example, in our experiments observers were shown only the front and sides of the object, so the rear side was not rendered and consequently was not reconstructed afterwards.
3.1. Saliency maps visualization
The initial stage of visualization is a reconstruction of the surface that was shown to the observers. This can be achieved by applying Eq. (2) for all the pixels of the image (Fig. 3a) and performing triangulation to create a mesh from a point cloud. The same transformation can be applied to a measured set of fixation points , which produces the result shown in Fig. 3b. As may be seen, the result meets expectations and looks like traces of human gaze on the object’s surface. In order to collate individual points to estimate a spatial distribution, 3D-filtering with a Gaussian kernel can be performed analogously to conventional 2D-filtering of flat maps. In order to simplify computations in continuous space, the points were assigned to a discrete 3D grid of voxels. The grid may have an arbitrary size, which in turn defines the desired resolution of the output map. In MatLab, the obtained values of voxels can be used as colormap indices for faces (or vertices) of the reconstructed surface in order to visualize the density of the distribution via color (Fig. 3c). Another option for visualization is to use voxels’ values as a 3D LUT for RGB values of a texture in the rendering engine, in the same way that the maps of coordinates (Fig. 2a) are created. The resultant colorized meshes are convenient for intuitive visualization of gaze distribution as well as more accurate, in comparison to 2D, spatial analysis of the measured fixations.
4. Experimental
The presented technique of processing of eye-tracking results in 3D was applied for a study of the impact of material appearance on human attention in computer graphics.
4.1. Stimuli
Observers were presented with short animations of classical sculptures rotating about their vertical axis (video demonstration: [link]). Rotation of the statues was intended to enhance the perception of shape and material appearance by providing motion parallax, including moving specularities and shadows. 3D models of classical statues were taken from a collection of high-resolution laser-scans created by Oliver Laric111http://threedscans.com. Images were created using the Mitsuba physically-based rendering engine, configured for hyperspectral rendering with 31 10-nm wavelength bins between 395 and 705 nm. Then, spectral data was converted to linear RGB using the measured spectral power distribution of the display. The three material conditions ”smooth gold”, ”rough gold”, and ”imitation gold coating” were configured as follows. For the ”smooth gold” condition, the gold Mitsuba material was used with roughness set to zero. For the ”rough gold” condition, the ggx roughness distribution was configured to 0.13. For the ”imitation gold coating” condition, the silver Mitsuba material was used with zero roughness, set beneath a thin, bumpy coating. The coating was created by scaling up each model by 1% and applying a displacement modifier configured with Stucco noise. The coating was assigned a refractive index of 1.4, and specified as a ”homogeneous medium” with an RGB absorbance of [0,0.10,0.75].
Animations were created with each sculpture rotating about its -axis with a sinusoidal motion of a full angle of 50*∘* split into 61 frames. The frame rate of the animation was equal to 30 fps. The light probe used is the ”Overcast day/building site” environment light probe made by Bernhard Vogel 222http://dativ.at/lightprobes/. Rendering was carried out with 256 samples using the ”Extended volumetric path tracer”, to a spatial resolution of 1024 1024 pixels.
4.2. Experimental setup
The stimuli were shown to observers on a Cambridge Research Systems (CRS) Display++ in ViSaGe mode located at the distance of 175 cm from the eyes. Fixations were measured using eye-tracker Eyelink1000 (with 2000Hz Extension).
The study was approved by the Medical Sciences Interdivisional Research Ethics Committee at the University of Oxford, in accordance with the Declaration of Helsinki. Observers were recruited from the staff and students of the university. There were 12 participants in total (5 males and 7 females), aged between 20-29 years old, all with normal color vision.
The animations of sculptures were presented under a set of different tasks (questions):
- (1)
Free viewing. 2. (2)
Describe the material the object is made from. 3. (3)
What is the age of the person? 4. (4)
Surmise what the person is thinking. 5. (5)
What clothes the person is wearing? 6. (6)
Remember and describe the pose of the person.
We presented three statues in three different materials. Questions 1-2 are general and were asked for all nine stimuli in random order, whereas questions 3-6 are related to semantic information, thus were asked only once for each statue (using different materials for different observers). The 8-second animations were followed by a response screen that was displayed while participants gave oral answers (not analysed here).
5. Results and discussion
According to the design of the experiment, there are multiple factors that influence the result: question, model, and material. Different combinations of these parameters produce 54 () output maps, which may be accessed online: [link]. A few of them (question (1), Hermes) are presented in Fig. 1.
5.1. Impact of a given task
The classical eye-tracking work of Yarbus demonstrates how the question asked about the painting influences what part of the painting the observers are more likely to look at (Yarbus, 1967). In this section, we want to evaluate how this translates to 3D computer graphics.
The reader could notice that the experimental questions (1)-(6) are designed in the following way: questions (1)-(2) are general, questions (3)-(4) stimulate the observer to look at the face of the sculpture, whereas questions (5)-(6) are specifically oriented on the body of the sculpture. Thus, we expect to see a large difference between a number of fixations located on the head and on the body between questions (3)-(4) and (5)-(6). In order to evaluate it quantitatively, we segment the statue into two parts (”head” and ”body”) and count the total number of fixations from all the observers in each region. Figure 4 presents a table that summarizes our results. We report the fraction of fixations on the ”head”, while ”body” can be found as a complementary part to 1. The bar chart in the background is aimed to facilitate the understanding of the results.
As may be seen, the results do agree with the theory of Yarbus. Average ”head”-fractions for questions (3) and (4) are 0.613 and 0.400 respectively, whereas the corresponding values for questions (5) and (6) are 0.155 and 0.202. There is no trivial dependence on the model or the material. The only observation is that the Hermes sculpture has fewer fixations on the head than Bressant and Penelope, which may be related to the fact that the head of this sculpture is displaced from the axis of rotation, thus it shifts a significantly larger distance and is harder to follow.
5.2. Statistics of fixations distribution
Apart from localization in semantically distinct regions, the obtained maps also differ by the shape of their distribution and density of fixation points. Particularly, some fixations are densely localized, while others are sparse. We characterize these effects by a set of standard metrics. A compactness of the distribution may be described by a classical cluster analysis metric Within-cluster Sum of Squares (WSS). Assuming the fixation points belong to one cluster, its centroid and mean WSS score can be found as follows:
[TABLE]
where , and is the total number of fixation points. We also report the variance of each spatial coordinates of the fixations , , and , the sum of which is equal to mean WSS (Eq. 3). After the fixation map is blurred with Gaussian and forms a probability map (saliency map), its values comprehensively describe the density of the fixations distribution. Thus, we also report the maximum value of a distribution (peak density) and the mean value.
The results for each combination of experimental parameters are presented in Fig. 5. Each box contains mean WSS (large number, left row), maximum and the mean of the blurred map (top and bottom numbers in the middle column), and variance of x, y, and z coordinates (top to bottom in the right column correspondingly).
It may be seen that maps concentrated at the face of a sculpture (Question 3) are the most localized. The difference of the values for rough (matte) and smooth (glossy) materials is not systematic and does not allow general conclusions to be drawn about the impact of material appearance on gaze behaviour.
5.3. Within-subject comparison
The previous results were obtained using the fixations of all observers pooled in a single map. This section presents an analysis of the difference in the gaze behaviour of each individual observer with specific regard to the rendered material. Also, to simplify the comparison, we consider only data from question (1) as the most general and natural task.
Table 1 presents the scores computed. Each score was computed for a single observer and then averaged between all the observers and all the sculptures. The first score is the difference of compactness of fixations distributions (mean WSS), such that a value of 0 means that both distributions are equally compact. The second and third scores are standard metrics for measuring the similarity of saliency maps: linear Correlation Coefficient (CC) (Le Meur et al., 2007)(Bylinskii et al., 2018), 1 or -1 means that maps are correlated, 0 - uncorrelated; and Similarity (SIM) also referred to as histogram intersection (Rubner et al., 2000)(Riche et al., 2013), such that the value of 1 means that maps are identical, 0 mean that maps are opposite.
Interestingly, both CC and SIM metrics show almost the same low level of similarity for each of the three pairs, which implies that the fixation distributions on all three materials are equally distinct, and do not correlate with each other. The WSS score shows the largest difference of compactness is between Rough (matte) and Smooth (glossy) materials, thus, we can conclude that using matte renders helps to concentrate the overt attention of the observer, while smooth, shiny surfaces make it more sparsely distributed. The Coating material with high bumpiness produces an intermediate level of compactness. However, these results only describe expected average trends, while particular cases may deviate from it significantly.
6. Conclusions
This work presents the development of a new method of re-projection of gaze-fixations measured on a 2D screen to a 3D surface of rendered objects. This significantly increases the accuracy of spatial analysis of fixations, as well as creating intuitively understandable 3D visualizations of the overt attention map. We apply this methodology to studying the relation between gaze behaviour and material appearance in computer graphics. We illustrate that the same 3D models rendered with different parameters of appearance have very different overt attention distributions. Common similarity metrics CC and SIM confirm low correlation between the maps. However, we were not able to detect systematic, generalize patterns of influence of different materials across statues. One of the practical conclusions we could make is that rough matte surfaces concentrate observers’ gaze slightly more densely than glossy surfaces, whereas the latter distribute gaze more sparsely. Another is that the task given to observers has a substantial influence on the direction of their gaze, which agrees with findings from conventional 2D eye-tracking.
Further development of this work will allow us to collect more gaze data, and to perform more detailed analysis with consideration of other semantic features for the characterization of fixation localization.
Acknowledgements.
This work was supported by the Arts and Humanities Research Council grant AH/N001222/1 to HES. OS is funded through the Master Color in Science and Industry (COSI). JSH is funded through the Andrew W. Mellon Foundation and Clarendon Fund.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Anderson (2011) Barton L Anderson. 2011. Visual perception of materials and surfaces. Current Biology 21, 24 (2011), R 978–R 983.
- 3Bergstrom and Schall (2014) Jennifer Romano Bergstrom and Andrew Schall. 2014. Eye tracking in user experience design . Elsevier.
- 4Bylinskii et al . (2018) Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2018. What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2018), 740–757.
- 5Chadwick and Kentridge (2015) AC Chadwick and RW Kentridge. 2015. The perception of gloss: A review. Vision research 109 (2015), 221–235.
- 6Cowen et al . (2002) Laura Cowen, Linden Js Ball, and Judy Delin. 2002. An eye movement analysis of web page usability. In People and computers XVI-memorable yet invisible . Springer, 317–335.
- 7Fleming (2014) Roland W Fleming. 2014. Visual perception of materials and their properties. Vision research 94 (2014), 62–75.
- 8Howlett et al . (2005) Sarah Howlett, John Hamill, and Carol O’Sullivan. 2005. Predicting and evaluating saliency for simplified polygonal models. ACM Transactions on Applied Perception (TAP) 2, 3 (2005), 286–308.
