U4D: Unsupervised 4D Dynamic Scene Understanding
Armin Mustafa, Chris Russell, Adrian Hilton

TL;DR
This paper presents an unsupervised approach for 4D dynamic scene understanding that jointly reconstructs, segments, and tracks multiple interacting people in complex scenes from multi-view video, achieving significant accuracy improvements.
Contribution
It introduces the first unsupervised method that combines 4D reconstruction, semantic segmentation, and motion analysis for dynamic scenes with multiple people.
Findings
Achieves approximately 40% improvement in semantic segmentation accuracy.
Demonstrates effective joint 4D reconstruction and segmentation in complex scenes.
Outperforms state-of-the-art methods on indoor and outdoor sequences.
Abstract
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (approx 40%)…
| Datasets | Resolution | Baseline | L | KF | Tracks | |
|---|---|---|---|---|---|---|
| Handshake[26] | 8(all S) | - | 125 | 15 | 1945 | |
| Meetup[17] | 16(all S) | - | 100 | 9 | 1341 | |
| Juggler2[4] | 6(all M) | - | 300 | 16 | 1278 | |
| Handstand[51] | 8(all S) | - | 174 | 12 | 1056 | |
| Rachel[2] | 16(all S) | - | 270 | 15 | 1978 | |
| Juggler1[2] | 8(2 M) | - | 253 | 17 | 2083 | |
| Dance[1] | 8(all S) | - | 60 | 7 | 732 | |
| Magician[4] | 6(all M) | - | 300 | 10 | 1312 | |
| Human3.6[23] | 4(all S) | - | 250 | 14 | 994 | |
| MagicianLF[39] | 25(all S) | - | 350 | 5 | 1312 | |
| WalkLF[39] | 20(all S) | - | 221 | 7 | 1934 |
| Outdoor | 1.2 | 0.5 | 0.5 | 0.4 | 1.0 | 5.0 | 0.6 | 7.5 |
|---|---|---|---|---|---|---|---|---|
| I, | 1.0 | 0.7 | 0.5 | 0.6 | 0.4 | 5.0 | 0.4 | 7.5 |
| I, | 1.0 | 0.7 | 0.2 | 0.4 | 0.4 | 5.0 | 0.4 | 5.0 |
| I, | 1.0 | 1.0 | 0.5 | 0.5 | 0.2 | 5.0 | 0.4 | 5.0 |
| Methods | Handshake | Handstand | Rachel | Juggler1 | Juggler2 | Magician | Dance | Meetup | Human3.6 | MagicianLF | WalkLF |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PRSM [52] | 1.56 | 1.79 | 1.51 | 1.57 | 1.68 | 1.72 | 1.79 | 1.98 | 2.01 | 1.59 | 1.41 |
| LS [44] | 1.24 | 1.38 | 1.15 | 1.21 | 1.18 | 1.33 | 1.46 | 1.47 | 1.64 | 1.20 | 1.23 |
| SMVS [29] | 0.84 | 0.97 | 0.73 | 0.75 | 0.85 | 0.92 | 0.85 | 0.96 | 1.19 | 0.94 | 0.88 |
| SCSR [36] | 0.70 | 0.84 | 0.67 | 0.69 | 0.73 | 0.78 | 0.77 | 0.87 | 0.92 | 0.77 | 0.71 |
| 0.73 | 0.87 | 0.65 | 0.70 | 0.71 | 0.75 | 0.74 | 0.88 | 0.90 | 0.78 | 0.70 | |
| 0.71 | 0.85 | 0.64 | 0.68 | 0.69 | 0.73 | 0.72 | 0.85 | 0.87 | 0.75 | 0.68 | |
| 0.57 | 0.71 | 0.56 | 0.59 | 0.61 | 0.64 | 0.62 | 0.75 | 0.77 | 0.67 | 0.63 | |
| 0.59 | 0.69 | 0.59 | 0.57 | 0.63 | 0.66 | 0.60 | 0.73 | 0.76 | 0.65 | 0.60 | |
| 0.55 | 0.68 | 0.55 | 0.54 | 0.59 | 0.61 | 0.59 | 0.74 | 0.73 | 0.62 | 0.59 | |
| Proposed | 0.46 | 0.55 | 0.47 | 0.49 | 0.51 | 0.53 | 0.55 | 0.57 | 0.60 | 0.49 | 0.44 |
| Methods | Handshake | Handstand | Rachel | Juggler1 | Juggler2 | Magician | Dance | Meetup | Human3.6 | MagicianLF | WalkLF |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CRFRNN [60] | 62.7 | 55.8 | 61.6 | 40.5 | 68.7 | 52.4 | 49.3 | 41.1 | 42.9 | 60.8 | 63.6 |
| Segnet [3] | 47.9 | 51.1 | 55.2 | 45.1 | 61.9 | 55.3 | 53.9 | 43.9 | 49.4 | 59.3 | 65.9 |
| JSR [17] | 67.8 | 58.7 | 58.4 | 56.2 | 66.0 | 61.3 | 57.9 | 50.2 | 53.4 | 62.3 | 68.9 |
| SCV [48] | 56.4 | 52.6 | 48.8 | 49.5 | 59.1 | 59.2 | 56.7 | 42.0 | 49.1 | 58.2 | 65.7 |
| Dv3+ [9] | 63.8 | 58.9 | 64.0 | 48.8 | 69.7 | 58.9 | 57.6 | 48.4 | 54.8 | 69.6 | 69.1 |
| MRCNN [21] | 65.2 | 59.6 | 67.4 | 50.3 | 70.5 | 60.5 | 58.7 | 47.2 | 53.4 | 69.5 | 70.2 |
| PSP [59] | 74.7 | 64.5 | 75.5 | 67.9 | 81.2 | 73.4 | 71.5 | 62.6 | 65.3 | 74.6 | 82.5 |
| SCSR [36] | 81.8 | 75.2 | 78.4 | 81.4 | 89.3 | 88.2 | 85.1 | 78.9 | 70.4 | 82.2 | 86.7 |
| 85.7 | 75.9 | 78.6 | 81.8 | 89.6 | 88.5 | 85.5 | 79.2 | 70.6 | 82.9 | 87.5 | |
| 86.3 | 77.4 | 80.7 | 82.6 | 90.1 | 89.1 | 87.6 | 80.8 | 76.3 | 86.1 | 89.3 | |
| 87.6 | 79.1 | 81.7 | 83.5 | 90.5 | 89.6 | 86.4 | 81.9 | 75.4 | 85.2 | 88.1 | |
| Proposed | 89.6 | 83.3 | 85.8 | 88.2 | 91.1 | 90.9 | 88.5 | 84.7 | 81.1 | 89.4 | 91.8 |
| Methods | Handshake | Handstand | Rachel | Juggler1 | Juggler2 | Magician | Dance | Meetup | Human3.6 | MagicianLF | WalkLF |
|---|---|---|---|---|---|---|---|---|---|---|---|
| PRSM [57] | 1.80 | 2.15 | 1.54 | 1.65 | 1.79 | 1.96 | 1.87 | 2.11 | 2.34 | 1.87 | 1.52 |
| Deepflow [54] | 1.15 | 1.48 | 1.01 | 1.08 | 1.16 | 1.27 | 1.21 | 1.37 | 1.52 | 1.05 | 0.81 |
| DCFlow [52] | 0.90 | 1.17 | 0.97 | 0.87 | 0.93 | 1.03 | 0.96 | 1.12 | 1.21 | 0.83 | 0.79 |
| 4DMatch [38] | 0.79 | 0.98 | 0.75 | 0.69 | 0.87 | 0.81 | 0.77 | 0.87 | 0.94 | 0.80 | 0.77 |
| 0.75 | 1.01 | 0.85 | 0.78 | 0.91 | 0.93 | 0.86 | 0.99 | 1.07 | 0.81 | 0.78 | |
| 0.71 | 0.93 | 0.80 | 0.73 | 0.84 | 0.87 | 0.78 | 0.92 | 0.99 | 0.76 | 0.73 | |
| 0.64 | 0.77 | 0.63 | 0.61 | 0.65 | 0.72 | 0.65 | 0.76 | 0.81 | 0.64 | 0.61 | |
| Proposed | 0.51 | 0.61 | 0.48 | 0.49 | 0.52 | 0.58 | 0.55 | 0.63 | 0.68 | 0.53 | 0.44 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
U4D: Unsupervised 4D Dynamic Scene Understanding
Armin Mustafa Chris Russell Adrian Hilton
CVSSP, University of Surrey, United Kingdom
{a.mustafa, c.russell, a.hilton}@surrey.ac.uk
Abstract
We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant () improvement in semantic segmentation, reconstruction and scene flow accuracy.
1 Introduction
With the advent of autonomous vehicles and rising demand for immersive content in augmented and virtual reality, understanding dynamic scenes has become increasingly important. In this paper we propose an unsupervised framework for 4D dynamic scene understanding to address this demand. By “4D Scene understanding” we refer to a unified framework that describes: 3D modelling; motion/flow estimation; and semantic instance segmentation on a per frame basis for an entire sequence. Recent advances in pose estimation [8, 46] and recognition [21, 56, 10] using deep learning have achieved excellent performance for complex images. We exploit these advances to obtain 3D human-pose and an initial semantic instance segmentation from multiple view videos to bootstrap the detailed 4D understanding and modelling of complex dynamic scenes captured with multiple static or moving cameras (see Figure 1). Joint 4D reconstruction allows us to understand how people move and interact, giving contextual information in general scenes.
Existing multi-task methods for scene understanding perform per frame joint reconstruction and semantic instance segmentation from a single image [25], showing that joint estimation can improve each task. Other methods have fused semantic segmentation with reconstruction [36] or flow estimation [42] demonstrating significant improvement in both semantic segmentation and reconstruction/scene flow. We exploit the joint estimation to understand dynamic scenes by simultaneous reconstruction, flow and segmentation estimation from multiple view video.
The first category of methods in joint estimation for dynamic scenes generate segmentation and reconstruction from multi-view [37] and monocular video [16, 30] without any output scene flow estimate. The second category of methods segment and estimates motion in 2D [42], or give spatio-temporal aligned segmentation [11, 34, 12] from multiple views without retrieving the shape of the objects. The third category of methods in 4D temporally coherent reconstruction either align meshes using correspondence information between consecutive frames [58] or extract the scene flow by estimating the pairwise surface correspondence between reconstructions at successive frames [53, 5]. However methods in these three categories do not exploit semantic information of the scene. The fourth category of joint estimation methods exploit semantic information by introducing joint semantic segmentation and reconstruction for general dynamic scenes [19, 56, 27, 49, 36] and street scenes [13, 50]. However these methods give per-frame semantic segmentation and reconstruction with no motion estimate leading to unaligned geometry and pixel level incoherence in both segmentation and reconstruction for dynamic sequences. Other methods for semantic video segmentation classify objects exploiting spatio-temporal semantic information [48, 34, 11] but do not perform reconstruction. We address this gap in the literature by proposing a novel unsupervised framework for joint multi-view 4D temporally coherent reconstruction, semantic instance segmentation and flow estimation for general dynamic scenes.
Methods in the literature have exploited human-pose information to improve results in semantic segmentation [55] and reconstruction [22]. However existing joint methods for dynamic scenes (with multiple people) do not exploit human-pose information often detecting interacting people as a single object [36]. Table 1 shows a comparison between the tasks performed by state-of-the-art methods. We exploit advances in 3D human-pose estimation to propose the first approach for 4D (3D in time) human-pose based scene understanding of general dynamic scenes with multiple interacting dynamic objects (people) with complex non-rigid motion. 3D human-pose estimation makes full use of multi-view information and is used as a prior to constrain the shape, segmentation and motion in space and time in the joint scene understanding estimation to improve the results. Our contributions are:
- •
High-level 4D scene understanding for general dynamic scenes from multi-view video.
- •
Joint instance-level segmentation, temporally coherent reconstruction and scene flow with human-pose priors.
- •
Robust 4D temporal coherence and per-pixel semantic coherence for dynamic scenes containing interactions.
- •
An extensive performance evaluation against 15 state-of-the-art methods demonstrating improved semantic segmentation, reconstruction and motion estimation.
2 Joint 4D dynamic scene understanding
This section describes our approach to joint 4D scene understanding, with different stages shown in Figure 2. The input to the joint optimisation is multi-view video, per-view initial semantic instance segmentation [21] and 3D human-pose estimation [47]. To achieve stable long-term 4D understanding a set of unique key-frames are detected exploiting multi-view information. Sparse temporal feature tracks are obtained per view between key-frames to initialise the joint estimation. This allows robust 4D understanding in the presence of large non-rigid motion between frames. An initial reconstruction is obtained for each object in the scene combining the initial semantic instance segmentation with the sparse reconstruction [36]. The initial reconstruction and semantic instance segmentation is refined for each object instance through novel joint optimisation of segmentation, shape, and motion constrained by 3D human-pose (Section 2.1). Key-frames are used to introduce robust temporal coherence in the joint estimation across long-sequences with large non-rigid deformation. Depth, motion and semantic instance segmentation is combined across views between frames for 4D temporally coherent reconstruction and dense per-pixel semantic coherence for final 4D understanding of scenes (Section 3).
2.1 Joint per-view optimisation
Existing methods for semantic segmentation do not give instance level segmentation of the scene. Previous approach either segment the image followed by a per-segment object category classification [35, 18], give deep per-pixel CNN features followed by per-pixel classification in the image [15, 20] or predict semantic segmentation from raw pixels [32] followed by conditional random fields [28, 60]. A recent state-of-the-art method gives a good estimate of initial semantic instance segmentation masks from an image of complex sequence [21]. We employ this approach to predict initial semantic instance segmentation pre-trained parameters on MS-COCO[31] and PASCAL VOC12 [14] for each view. Per-view semantic instance segmentation is combined across views with sparse reconstruction to obtain an initial reconstruction for each frame [36], this is refined through a joint scene understanding optimisation.
The goal of the joint estimation is to refine initial semantic instance segmentation and reconstruction by assigning a label from a set of classes obtained from initial semantic instance segmentation ( is the total number of classes), a depth value from a set of depth values (each depth value is sampled on the ray from camera and is an unknown depth value to handle occlusions), and a motion flow field simultaneously for the region of each object per view. is the pre-defined discrete flow-fields for pixel in image by in time. Joint semantic instance segmentation, reconstruction and motion estimation is achieved by global optimisation of a cost function over unary and pairwise terms, defined as:
[TABLE]
[TABLE]
[TABLE]
where, is the depth, is the class label, and is the motion at pixel . Novel terms are introduced for flow , motion regularisation and human-pose costs, explained in Section 2.1.3 and 2.1.2 respectively. Results of the joint optimisation with and without pose () and motion ( , ) information are presented in Figure 3, showing the improvement in results. Ablative analysis on individual costs in Section 4 show the improvement in performance with the novel introduction of motion and pose constraints in the joint optimisation. Standard unary terms for depth (), semantic (), and appearance () costs are used [36], explained in Section 2.1.5. Standard pairwise terms colour contrast () is used to assist segmentation and smoothness () cost ensures that depth vary smoothly in a neighbourhood, are explained in Appendix A of the supplementary material.
Global optimisation of Equation 1 is performed over all terms simultaneously, using the -expansion algorithm by iterating through the set of labels in [7]. Each iteration is solved by graph-cut using the min-cut/max-flow algorithm [6]. Convergence is achieved in 7-8 iterations.
2.1.1 Spatio-temporal coherence in the optimisation
Constraints are applied on the spatial and temporal neighborhood to enforce consistency in the appearance, semantic label, 3D human pose and motion across views and time.
Spatial coherence: Multi-view spatial coherence is enforced in the optimisation such that the motion, shape, appearance, 3D pose and class labels are consistent across views using an 8-connected spatial neighbourhood for each camera view such that the set of pixel pairs belong to the same frame.
Temporal coherence: Temporal coherence is enforced in the joint optimisation by enforcing coherence across key-frames to handle large non-rigid motion and to reduce errors in sequential alignment for long sequences in the 4D scene understanding. Sparse temporal feature correspondences are used for key-frame detection and robust initialisation of the joint optimisation. They measure the similarity between frames and unlike optical flow are robust to large motions and visual ambiguity. To achieve robust temporal coherence in the 4D scene understanding framework for large non-rigid motion, sparse temporal feature correspondences in 3D are obtained across the sequence.
The temporal neighbourhood is defined for each frame between its respective key-frames. Sparse temporal correspondence tracks define the temporal neighbourhood ; where and is the displacement vector from image to .
2.1.2 Human-pose constraints
We use 3D human-pose to constrain joint optimisation and improve the flow, reconstruction and instance segmentation, in both 2D and 3D for dynamic scenes with multiple interacting people (see Figure 1). 3D human-pose is used as it is consistent across multiple views unlike 2D human-pose. A state-of-the-art method for 3D human-pose estimation from multiple cameras [47] is used in the paper. Previous work on 3D pose estimation [46] iteratively builds a 3D model of human-pose consistent with 2D estimates of joint locations and prior knowledge of natural body pose. In [47], multiple cameras are used when estimating the 3D model; this then feeds back into new estimates of the 2D joint locations in each image. This approach allows us to take full advantage of 3D estimates of pose, consistent across all cameras when finding fine grained 2D correspondences between images, and leading to more lifelike, vivid human reconstructions.
Initial semantic reconstruction is updated if the 3D pose of the person lies outside the region by dilating the boundary to include the missing joints. This allows for more robust and complete reconstruction and segmentation. We use a standard set of 17 joints [47] defined as . A circle is placed around the joint position in 2D and a sphere is placed around the joint position in 3D based on the confidence map to identify the nearest neighbour vertices for every joint .
[TABLE]
[TABLE]
[TABLE]
3D shape term: This term constrains the reconstruction in 3D such that the neighbourhood points around the joints do not move far from the respective joints, and is defined as:
[TABLE]
where is the 3D projection of pixel . The Frobenius norm is applied on the 3D points in all directions to obtain the ‘net’ motion at each pixel within and .
3D motion term: This enforces as rigid as possible [43] constraint on 3D points in the neighbourhood of each joint in space and time. An optimal rotation matrix is estimated for each by minimising the energy defined as:
[TABLE]
2D term: 3D poses are back-projected in each view to constrain per view appearance (), semantic segmentation () and motion estimation () in 2D. If ,
[TABLE]
where, is the back-projection of 3D poses to 2D, is the number of nearest neighbours, and, and is defined similarly. and ensures that the pixels around projected 3D pose have the same semantic label and appearance across views () and time () thereby ensuring spatio-temporal appearance and semantic consistency respectively.
2.1.3 Motion constraints-
Flow term: This term is obtained by integrating the sum of three penalisers over the reference image domain inspired from [45], defined as:
where, penalises deviation from the brightness constancy assumption in a temporal neighbourhood for the same view; penalises deviation in appearance from the brightness constancy assumption between the reference view and other views at other time instants; and which forces the flow to be close to nearby sparse temporal correspondences. is the intensity at point at time in camera . The flow vector is located within a window from a sparse constraint at and it forces the flow to approximate the sparse 2D temporal correspondences.
Motion regularisation term: This penalises the absolute difference of the flow field to enforce motion smoothness and handle occlusions in areas with low confidence [45].
where and;
else [math]. We compute (semantic regularisation) and (appearance regularisation) as the minimum subtracted from the mean energy within the search window for each pixel .
2.1.4 Long-term temporal coherence
Sparse temporal correspondences: The sparse 3D points projected in all views are matched between frames and key-frames across the sequence using nearest neighbour matching [33] followed by a symmetry test which employs forward and backward match consistency by performing two-way matching to remove the inconsistent correspondences. This gives sparse temporal feature correspondence tracks per frame for each object: , where . are the 3D points visible at each frame . Exhaustive matching is done, such that each frame is matched to every other frame to handle appearance, reappearance and disappearance of points between frames.
Key-frame detection: Previous work [40, 39] showed that sparse key-frames allow robust long-term correspondence for 4D reconstruction. In this work we introduce the additional use of pose in the detection and sparse temporal feature correspondence across key-frames to prevent the accumulation of errors in long sequences. 4D scene alignment between key-frames is explained in Section 3.
Key-frame similarity metric is defined as:
[TABLE]
Key-frame detection exploits sparse correspondence (), pose (), shape (), semantic () and distance () information across views between frame and for each object in view , to improve the long-term temporal coherence of the proposed method, using similar frames across the sequence, illustrated in Figure 4. All frames with similarity in a sequence are selected as key-frames defined as where is the number of key-frames and is the number of frames between and . All the metrics used in 5 and an ablation study for key-frame detection is given in detail in Appendix B of supplementary material.
Features at view frame , are matched to features at view to frames to give correspondences for all the frames with key-frame . The corresponding joint locations from the 3D pose are back-projected in each view and added to sparse temporal tracks in between key-frames. Any new point-tracks are added to the list of point tracks for key-frame .
2.1.5 Unary terms -
Depth term: This gives a measure of photo-consistency between views , defined as:
[TABLE]
where is the fixed cost of labelling pixel unknown and denotes the projection of the hypothesised point ( point along the optical ray passing through pixel located at a distance from the camera) in an auxiliary camera. is the set of the most photo-consistent pairs with reference camera and is inspired from [37].
Appearance term: This term is computed using the negative log likelihood [6] of the colour models (GMMs with 10 components) learned from the initial semantic mask in the temporal neighbourhood and the foreground markers obtained from the sparse 3D features for the dynamic objects. It is defined as:
where denotes the probability of pixel belonging to layer .
Semantic term: This term is based on the probability of the class labels at each pixel based on [10], defined as:
where denotes the probability of pixel being in layer in the reference image obtained from initial semantic instance segmentation [21].
3 4D scene understanding
The final 4D scene model fuses the semantic instance segmentation, depth information and dense flow across views and in time between frames () and key-frames (). The initial instance segmentation, human pose and motion information for each object is combined to obtain final instance segmentation of the scene. The depth information is combined across views using Poisson surface reconstruction [24] to obtain a mesh for each object in the scene. 4D temporally coherent meshes are obtained by combining the most consistent motion information from all views for each 3D point. This is combined with spatial semantic instance information to give per-pixel semantic and temporal coherence. Appearing, disappearing, and reappearing regions are handled by using the sparse temporal tracks and their respective motion estimate. The dense flow and semantic instance segmentation together with 3D models of each object in the scene gives the final 4D understanding of the scenes. Examples are shown in Figure 1 and 5 on two datasets, where objects are coloured in one key-frame and colours are propagated reliably between frames and key-frames across the sequence for robust 4D scene modelling.
4 Results and evaluation
Joint semantic instance segmentation, reconstruction and flow estimation (section 2) is evaluated quantitatively and qualitatively against state-of-the-art methods on a variety of publically available multi-view indoor and outdoor dynamic scene datasets, detailed in Table 2. More results are provided in supplementary material Appendix C.
Algorithm parameters listed in Table 3 are the same for all outdoor datasets, and for indoor datasets parameters depend on the number of cameras (). Pairwise costs are constant , for all datasets.
4.1 Reconstruction evaluation
The proposed approach is compared against state-of-the-art approaches for semantic co-segmentation and reconstruction (SCSR) [36], piecewise scene flow (PRSM) [52], multi-view stereo (SMVS) [29], and deep learning based stereo approaches (LocalStereo) [44]. Qualitative comparison with 2 views of proposed method are shown in Figure 6. Pre-trained parameters were used for LocalStereo and per-view depth maps were fused using Poisson reconstruction. The quality of surface obtained using proposed method is improved compared to state-of-the-art methods. In contrast to previous approaches, limbs of people are reliably reconstructed because of the exploitation of human-pose and temporal information (motion) in the joint optimisation.
For quantitative comparison to state-of-the-art methods, we project the reconstruction onto different views and compute the projection errors shown in Table 4. A significant improvement is obtained in projected surface completeness with the proposed approach.
4.2 Segmentation evaluation
Our approach is evaluated against a variety of state-of-the-art multi-view (SCV [48], SCSR [36], and JSR [17]) and single-view (Dv3+ [9], MRCNN [21], PSP [59], CRF RNN [60], and Segnet [3]) segmentation methods, shown in Figure 7. For fair evaluation against single-view semantic segmentation methods, multi-view consistency is applied for segmentation estimated from each view to obtain multi-view consistent semantic segmentation using dense multi-view correspondence. Colour in the results is kept from the original papers. Only MRCNN and the proposed approach gives instance segmentation.
Quantitative evaluation against state-of-the-art methods is measured by Intersection-over-Union with ground-truth, shown in Table 5. Ground-truth is available on-line for most of the datasets and obtained by manual labelling for other datasets. Pre-trained parameters were used for semantic segmentation methods. The semantic instance segmentation results from the joint optimisation are significantly better compared to the state-of-the-art methods ().
4.3 Motion evaluation
Flow from the joint estimation is evaluated against state-of-the-art methods: (a) Dense flow algorithms DCflow [57] and Deepflow [54]; (b) Scene flow methods PRSM [52]; and (c) Non-sequential alignment of partial surfaces 4DMatch [38] (requires a prior 3D mesh of the object as input for 4D reconstruction). The key-frames of sequence are coloured and the colour is propagated using dense flow from the joint optimisation throughout the sequence. The red regions in 2D dense flow in Figure 8 are the regions for which reliable correspondences are not found. This demonstrates improved performance using the proposed method. The colours in the 4D alignment in Figure 9 are not reliably propagated by DCFlow for limbs.
We also compare the silhouette overlap error () across frames, key-frames and views to evaluate long-term temporal coherence in Table 6 for all datasets. This is defined as . Dense flow in time is used to obtain the propagated mask for each image. The propagated mask is overlapped with semantic segmentation at each time instant to evaluate the accuracy of the propagated mask. The lower the the better. Our approach gives the lowest error demonstrating higher accuracy compared to the state-of-the-art methods.
4.4 Ablation study on Equation 1
We perform an ablation study on Equation 1, such that we remove motion , pose and semantic constraints from the equation, defining and . Reconstruction, flow and semantic segmentation is obtained with removed constraints, and the results are shown in Tables 4, 6 and 5 respectively. The proposed approach gives best performance with joint pose, motion and semantic constraints.
4.5 Limitations
Gross errors in initial semantic instance segmentation and 3D pose estimation lead to degradation in the quality of results (e.g. the cars in Juggler2 - Figure 7). Although 3D human pose helps in robust 4D reconstruction of interacting people in dynamic scenes, current 3D pose estimation is unreliable for highly crowded environments resulting in degradation of the proposed approach.
5 Conclusions
This paper introduced the first method for unsupervised 4D dynamic scene understanding from multi-view video. A novel joint flow, reconstruction and semantic instance segmentation estimation framework is introduced exploiting 2D/3D human-pose, motion, semantic, shape and appearance information in space and time. Ablation study on the joint optimisation demonstrates the effectiveness of the proposed scene understanding framework for general scenes with multiple interacting people. The semantic, motion and depth information per view is fused spatially across views for 4D semantically and temporally coherent scene understanding. Extensive evaluation against state-of-the-art methods on a variety of complex indoor and outdoor datasets with large non-rigid deformations demonstrates a significant improvement in the accuracy in semantic segmentation, reconstruction, motion estimation and 4D alignment.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] 4d repository, http://4drepository.inrialpes.fr/. In Institut national de recherche en informatique et en automatique (INRIA) Rhone Alpes .
- 2[2] Multiview video repository, http://cvssp.org/data/cvssp 3d/. In Centre for Vision Speech and Signal Processing, University of Surrey, UK .
- 3[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI , 2017.
- 4[4] L. Ballan, G. J. Brostow, J. Puwein, and M. Pollefeys. Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Trans. Graph. , 29(4):1–11, 2010.
- 5[5] T. Basha, Y. Moses, and N. Kiryati. Multi-view scene flow estimation: A view centered variational approach. In CVPR , pages 1506–1513, 2010.
- 6[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. TPAMI , 26(11):1124–1137, 2004.
- 7[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. TPAMI , 23(11):1222–1239, 2001.
- 8[8] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR , 2017.
