U4D: Unsupervised 4D Dynamic Scene Understanding

Armin Mustafa; Chris Russell; Adrian Hilton

arXiv:1907.09905·cs.CV·July 24, 2019

U4D: Unsupervised 4D Dynamic Scene Understanding

Armin Mustafa, Chris Russell, Adrian Hilton

PDF

TL;DR

This paper presents an unsupervised approach for 4D dynamic scene understanding that jointly reconstructs, segments, and tracks multiple interacting people in complex scenes from multi-view video, achieving significant accuracy improvements.

Contribution

It introduces the first unsupervised method that combines 4D reconstruction, semantic segmentation, and motion analysis for dynamic scenes with multiple people.

Findings

01

Achieves approximately 40% improvement in semantic segmentation accuracy.

02

Demonstrates effective joint 4D reconstruction and segmentation in complex scenes.

03

Outperforms state-of-the-art methods on indoor and outdoor sequences.

Abstract

We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant (approx 40%)…

Tables6

Table 1. Table 1: Comparison of tasks state-of-the-art methods are solving against the proposed method.

	Semantic	Segment	Instance	3D	Motion	Pose
[25, 49, 13]	✓	✓	✓	✓	$\times$	$\times$
[42]	✓	✓	✓	$\times$	✓	$\times$
[36, 19, 27]	✓	✓	$\times$	✓	$\times$	$\times$
[55]	✓	✓	✓	$\times$	$\times$	✓
[22]	$\times$	$\times$	$\times$	✓	✓	✓
[16]	✓	✓	$\times$	✓	✓	$\times$
[30, 41]	$\times$	$\times$	✓	✓	✓	$\times$
[37]	$\times$	✓	$\times$	✓	✓	$\times$
[48, 34, 11]	✓	✓	$\times$	$\times$	✓	$\times$
Proposed	✓	✓	✓	✓	✓	✓

Table 2. Table 2: Properties of all datasets: N v subscript 𝑁 𝑣 N_{v} is the number of views, L is the sequence length, KF gives number of key-frames, and Tracks gives the number of sparse temporal correspondence tracks averaged over the entire sequence for each object (S stands for static cameras and M for moving cameras).

Datasets	Resolution	$N_{v}$	Baseline	L	KF	Tracks
Handshake[26]	$1920 \times 1080$	8(all S)	$15 °$ - $30 °$	125	15	1945
Meetup[17]	$1920 \times 1080$	16(all S)	$25 °$ - $35 °$	100	9	1341
Juggler2[4]	$960 \times 544$	6(all M)	$15 °$ - $45 °$	300	16	1278
Handstand[51]	$1600 \times 1200$	8(all S)	$25 °$ - $45 °$	174	12	1056
Rachel[2]	$3840 \times 2160$	16(all S)	$20 °$ - $30 °$	270	15	1978
Juggler1[2]	$1920 \times 1080$	8(2 M)	$15 °$ - $30 °$	253	17	2083
Dance[1]	$780 \times 582$	8(all S)	$35 °$ - $45 °$	60	7	732
Magician[4]	$960 \times 544$	6(all M)	$15 °$ - $45 °$	300	10	1312
Human3.6[23]	$1000 \times 1000$	4(all S)	$25 °$ - $30 °$	250	14	994
MagicianLF[39]	$2048 \times 2048$	25(all S)	$5 °$ - $8 °$	350	5	1312
WalkLF[39]	$2048 \times 2048$	20(all S)	$5 °$ - $8 °$	221	7	1934

Table 3. Table 3: Parameters for all datasets. I is Indoor

	$λ_{d}$	$λ_{a}$	$λ_{s e m}$	$λ_{f}$	$λ_{s}^{t} / λ_{s}^{s}$	$λ_{c a} / λ_{c l}$	$λ_{r}^{L} / λ_{r}^{C}$	$λ_{2 d} / λ_{3 d}$
Outdoor	1.2	0.5	0.5	0.4	1.0	5.0	0.6	7.5
I, $N_{v} < 6$	1.0	0.7	0.5	0.6	0.4	5.0	0.4	7.5
I, $6 \leq N_{v} < 20$	1.0	0.7	0.2	0.4	0.4	5.0	0.4	5.0
I, $N_{v} \geq 20$	1.0	1.0	0.5	0.5	0.2	5.0	0.4	5.0

Table 4. Table 4: Reconstruction evaluation: Projection error across views against state-of-the-art methods, LS is LocalStereo. P P = E − E p , P M = E − E f − E r , P P M = E − E f − E r − E p , P S = E − E s e m formulae-sequence subscript 𝑃 𝑃 𝐸 subscript 𝐸 𝑝 formulae-sequence subscript 𝑃 𝑀 𝐸 subscript 𝐸 𝑓 subscript 𝐸 𝑟 formulae-sequence subscript 𝑃 𝑃 𝑀 𝐸 subscript 𝐸 𝑓 subscript 𝐸 𝑟 subscript 𝐸 𝑝 subscript 𝑃 𝑆 𝐸 subscript 𝐸 𝑠 𝑒 𝑚 P_{P}=E-E_{p},P_{M}=E-E_{f}-E_{r},P_{PM}=E-E_{f}-E_{r}-E_{p},P_{S}=E-E_{sem} and P P S = E − E s e m − E p subscript 𝑃 𝑃 𝑆 𝐸 subscript 𝐸 𝑠 𝑒 𝑚 subscript 𝐸 𝑝 P_{PS}=E-E_{sem}-E_{p} , where E 𝐸 E is defined in Equation 1 .

Methods	Handshake	Handstand	Rachel	Juggler1	Juggler2	Magician	Dance	Meetup	Human3.6	MagicianLF	WalkLF
PRSM [52]	1.56	1.79	1.51	1.57	1.68	1.72	1.79	1.98	2.01	1.59	1.41
LS [44]	1.24	1.38	1.15	1.21	1.18	1.33	1.46	1.47	1.64	1.20	1.23
SMVS [29]	0.84	0.97	0.73	0.75	0.85	0.92	0.85	0.96	1.19	0.94	0.88
SCSR [36]	0.70	0.84	0.67	0.69	0.73	0.78	0.77	0.87	0.92	0.77	0.71
$P_{P S}$	0.73	0.87	0.65	0.70	0.71	0.75	0.74	0.88	0.90	0.78	0.70
$P_{P M}$	0.71	0.85	0.64	0.68	0.69	0.73	0.72	0.85	0.87	0.75	0.68
$P_{P}$	0.57	0.71	0.56	0.59	0.61	0.64	0.62	0.75	0.77	0.67	0.63
$P_{S}$	0.59	0.69	0.59	0.57	0.63	0.66	0.60	0.73	0.76	0.65	0.60
$P_{M}$	0.55	0.68	0.55	0.54	0.59	0.61	0.59	0.74	0.73	0.62	0.59
Proposed	0.46	0.55	0.47	0.49	0.51	0.53	0.55	0.57	0.60	0.49	0.44

Table 5. Table 5: Segmentation comparison against state-of-the-art methods using the Intersection-over-Union metric.

Methods	Handshake	Handstand	Rachel	Juggler1	Juggler2	Magician	Dance	Meetup	Human3.6	MagicianLF	WalkLF
CRFRNN [60]	62.7	55.8	61.6	40.5	68.7	52.4	49.3	41.1	42.9	60.8	63.6
Segnet [3]	47.9	51.1	55.2	45.1	61.9	55.3	53.9	43.9	49.4	59.3	65.9
JSR [17]	67.8	58.7	58.4	56.2	66.0	61.3	57.9	50.2	53.4	62.3	68.9
SCV [48]	56.4	52.6	48.8	49.5	59.1	59.2	56.7	42.0	49.1	58.2	65.7
Dv3+ [9]	63.8	58.9	64.0	48.8	69.7	58.9	57.6	48.4	54.8	69.6	69.1
MRCNN [21]	65.2	59.6	67.4	50.3	70.5	60.5	58.7	47.2	53.4	69.5	70.2
PSP [59]	74.7	64.5	75.5	67.9	81.2	73.4	71.5	62.6	65.3	74.6	82.5
SCSR [36]	81.8	75.2	78.4	81.4	89.3	88.2	85.1	78.9	70.4	82.2	86.7
$P_{P M}$	85.7	75.9	78.6	81.8	89.6	88.5	85.5	79.2	70.6	82.9	87.5
$P_{P}$	86.3	77.4	80.7	82.6	90.1	89.1	87.6	80.8	76.3	86.1	89.3
$P_{M}$	87.6	79.1	81.7	83.5	90.5	89.6	86.4	81.9	75.4	85.2	88.1
Proposed	89.6	83.3	85.8	88.2	91.1	90.9	88.5	84.7	81.1	89.4	91.8

Table 6. Table 6: Silhouette overlap error for multi-view datasets for evaluation of long-term temporal coherence, where .

Methods	Handshake	Handstand	Rachel	Juggler1	Juggler2	Magician	Dance	Meetup	Human3.6	MagicianLF	WalkLF
PRSM [57]	1.80	2.15	1.54	1.65	1.79	1.96	1.87	2.11	2.34	1.87	1.52
Deepflow [54]	1.15	1.48	1.01	1.08	1.16	1.27	1.21	1.37	1.52	1.05	0.81
DCFlow [52]	0.90	1.17	0.97	0.87	0.93	1.03	0.96	1.12	1.21	0.83	0.79
4DMatch [38]	0.79	0.98	0.75	0.69	0.87	0.81	0.77	0.87	0.94	0.80	0.77
$P_{P S}$	0.75	1.01	0.85	0.78	0.91	0.93	0.86	0.99	1.07	0.81	0.78
$P_{P}$	0.71	0.93	0.80	0.73	0.84	0.87	0.78	0.92	0.99	0.76	0.73
$P_{S}$	0.64	0.77	0.63	0.61	0.65	0.72	0.65	0.76	0.81	0.64	0.61
Proposed	0.51	0.61	0.48	0.49	0.52	0.58	0.55	0.63	0.68	0.53	0.44

Equations22

E (l, d, m) = E_{u na r y} (l, d, m) + E_{p ai r} (l, d, m)

E (l, d, m) = E_{u na r y} (l, d, m) + E_{p ai r} (l, d, m)

E_{u na r y} = λ_{d} E_{d} (d) + λ_{a} E_{a} (l) + λ_{se m} E_{se m} (l) + λ_{f} E_{f} (m)

E_{u na r y} = λ_{d} E_{d} (d) + λ_{a} E_{a} (l) + λ_{se m} E_{se m} (l) + λ_{f} E_{f} (m)

E_{p ai r} = λ_{s} E_{s} (l, d) + λ_{c} E_{c} (l) + λ_{r} E_{r} (l, m) + λ_{p} E_{p} (l, d, m)

E_{p ai r} = λ_{s} E_{s} (l, d) + λ_{c} E_{c} (l) + λ_{r} E_{r} (l, m) + λ_{p} E_{p} (l, d, m)

E_{p} (l, d, m) = b_{i} \in B \sum λ_{2 d} e_{2 d} (l, m) + λ_{3 d} e_{3 d} (d)

E_{p} (l, d, m) = b_{i} \in B \sum λ_{2 d} e_{2 d} (l, m) + λ_{3 d} e_{3 d} (d)

e_{2 d} (l, m) = e_{2 d}^{L} (l) + e_{2 d}^{S} (l) + e_{2 d}^{M} (m)

e_{2 d} (l, m) = e_{2 d}^{L} (l) + e_{2 d}^{S} (l) + e_{2 d}^{M} (m)

e_{3 d} (d) = e_{3 d}^{M} (d) + e_{3 d}^{S} (d), if d_{p} \neq = U else 0

e_{3 d} (d) = e_{3 d}^{M} (d) + e_{3 d}^{S} (d), if d_{p} \neq = U else 0

e_{3 d}^{S} (d) = exp (- \frac{1}{∣ σ _{S_{D}} ∣} Φ (p) \in S_{i} \sum ∥ O ∥_{F}^{2}) \vspace - 0.25 c m

e_{3 d}^{S} (d) = exp (- \frac{1}{∣ σ _{S_{D}} ∣} Φ (p) \in S_{i} \sum ∥ O ∥_{F}^{2}) \vspace - 0.25 c m

e_{3 d}^{M} (d) = Φ (p) \in S_{i} \sum (b_{i}^{t + 1} - Φ (p)^{t + 1}) - R_{i} (b_{i}^{t} - Φ (p)^{t})_{2}^{2} + λ_{3 d}^{p} p - e_{3 d}^{M}_{2}^{2}

e_{3 d}^{M} (d) = Φ (p) \in S_{i} \sum (b_{i}^{t + 1} - Φ (p)^{t + 1}) - R_{i} (b_{i}^{t} - Φ (p)^{t})_{2}^{2} + λ_{3 d}^{p} p - e_{3 d}^{M}_{2}^{2}

e_{2 d}^{L} (l) = exp - p \in ψ_{S} \sum p \in ψ_{T} \sum \frac{∥ I ( Π ( b _{i} )) - I ( p ) ∥ ^{2}}{∣ σ _{S_{L}} ∣} e_{2 d}^{S} (l) = exp - p \in ψ_{S} \sum p \in ψ_{T} \sum \frac{∥ Π ( b _{i} ) - p ∥ ^{2}}{∣ σ _{S_{S}} ∣} e_{2 d}^{M} (m) = exp - p \in ψ_{S} \sum k \in ψ_{T} \sum \frac{ϑ _{p, Π (b_{i}^{k})} - ϑ _{p + m_{p}, Π (b_{i}^{k + 1})} ^{2}}{∣ σ _{S_{M}} ∣}

e_{2 d}^{L} (l) = exp - p \in ψ_{S} \sum p \in ψ_{T} \sum \frac{∥ I ( Π ( b _{i} )) - I ( p ) ∥ ^{2}}{∣ σ _{S_{L}} ∣} e_{2 d}^{S} (l) = exp - p \in ψ_{S} \sum p \in ψ_{T} \sum \frac{∥ Π ( b _{i} ) - p ∥ ^{2}}{∣ σ _{S_{S}} ∣} e_{2 d}^{M} (m) = exp - p \in ψ_{S} \sum k \in ψ_{T} \sum \frac{ϑ _{p, Π (b_{i}^{k})} - ϑ _{p + m_{p}, Π (b_{i}^{k + 1})} ^{2}}{∣ σ _{S_{M}} ∣}

K S_{i, j} = 1 - \frac{1}{5 N _{v}} c = 1 \sum N_{v} (M_{i, j}^{c} + L_{i, j}^{c} + D_{i, j}^{c} + P_{i, j}^{c} + I_{i, j}^{c}) \vspace - 0.25 c m

K S_{i, j} = 1 - \frac{1}{5 N _{v}} c = 1 \sum N_{v} (M_{i, j}^{c} + L_{i, j}^{c} + D_{i, j}^{c} + P_{i, j}^{c} + I_{i, j}^{c}) \vspace - 0.25 c m

e_{d} (p, d_{p}) = {M (p, q) = \sum_{i \in O_{k}} m (p, q), M_{U}, if d_{p} \neq = U if d_{p} = U \vspace - 0.25 c m

e_{d} (p, d_{p}) = {M (p, q) = \sum_{i \in O_{k}} m (p, q), M_{U}, if d_{p} \neq = U if d_{p} = U \vspace - 0.25 c m

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

U4D: Unsupervised 4D Dynamic Scene Understanding

Armin Mustafa Chris Russell Adrian Hilton

CVSSP, University of Surrey, United Kingdom

{a.mustafa, c.russell, a.hilton}@surrey.ac.uk

Abstract

We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photo-consistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequences demonstrates a significant ( $\approx 40\%$ ) improvement in semantic segmentation, reconstruction and scene flow accuracy.

1 Introduction

With the advent of autonomous vehicles and rising demand for immersive content in augmented and virtual reality, understanding dynamic scenes has become increasingly important. In this paper we propose an unsupervised framework for 4D dynamic scene understanding to address this demand. By “4D Scene understanding” we refer to a unified framework that describes: 3D modelling; motion/flow estimation; and semantic instance segmentation on a per frame basis for an entire sequence. Recent advances in pose estimation [8, 46] and recognition [21, 56, 10] using deep learning have achieved excellent performance for complex images. We exploit these advances to obtain 3D human-pose and an initial semantic instance segmentation from multiple view videos to bootstrap the detailed 4D understanding and modelling of complex dynamic scenes captured with multiple static or moving cameras (see Figure 1). Joint 4D reconstruction allows us to understand how people move and interact, giving contextual information in general scenes.

Existing multi-task methods for scene understanding perform per frame joint reconstruction and semantic instance segmentation from a single image [25], showing that joint estimation can improve each task. Other methods have fused semantic segmentation with reconstruction [36] or flow estimation [42] demonstrating significant improvement in both semantic segmentation and reconstruction/scene flow. We exploit the joint estimation to understand dynamic scenes by simultaneous reconstruction, flow and segmentation estimation from multiple view video.

The first category of methods in joint estimation for dynamic scenes generate segmentation and reconstruction from multi-view [37] and monocular video [16, 30] without any output scene flow estimate. The second category of methods segment and estimates motion in 2D [42], or give spatio-temporal aligned segmentation [11, 34, 12] from multiple views without retrieving the shape of the objects. The third category of methods in 4D temporally coherent reconstruction either align meshes using correspondence information between consecutive frames [58] or extract the scene flow by estimating the pairwise surface correspondence between reconstructions at successive frames [53, 5]. However methods in these three categories do not exploit semantic information of the scene. The fourth category of joint estimation methods exploit semantic information by introducing joint semantic segmentation and reconstruction for general dynamic scenes [19, 56, 27, 49, 36] and street scenes [13, 50]. However these methods give per-frame semantic segmentation and reconstruction with no motion estimate leading to unaligned geometry and pixel level incoherence in both segmentation and reconstruction for dynamic sequences. Other methods for semantic video segmentation classify objects exploiting spatio-temporal semantic information [48, 34, 11] but do not perform reconstruction. We address this gap in the literature by proposing a novel unsupervised framework for joint multi-view 4D temporally coherent reconstruction, semantic instance segmentation and flow estimation for general dynamic scenes.

Methods in the literature have exploited human-pose information to improve results in semantic segmentation [55] and reconstruction [22]. However existing joint methods for dynamic scenes (with multiple people) do not exploit human-pose information often detecting interacting people as a single object [36]. Table 1 shows a comparison between the tasks performed by state-of-the-art methods. We exploit advances in 3D human-pose estimation to propose the first approach for 4D (3D in time) human-pose based scene understanding of general dynamic scenes with multiple interacting dynamic objects (people) with complex non-rigid motion. 3D human-pose estimation makes full use of multi-view information and is used as a prior to constrain the shape, segmentation and motion in space and time in the joint scene understanding estimation to improve the results. Our contributions are:

•

High-level 4D scene understanding for general dynamic scenes from multi-view video.

•

Joint instance-level segmentation, temporally coherent reconstruction and scene flow with human-pose priors.

•

Robust 4D temporal coherence and per-pixel semantic coherence for dynamic scenes containing interactions.

•

An extensive performance evaluation against 15 state-of-the-art methods demonstrating improved semantic segmentation, reconstruction and motion estimation.

2 Joint 4D dynamic scene understanding

This section describes our approach to joint 4D scene understanding, with different stages shown in Figure 2. The input to the joint optimisation is multi-view video, per-view initial semantic instance segmentation [21] and 3D human-pose estimation [47]. To achieve stable long-term 4D understanding a set of unique key-frames are detected exploiting multi-view information. Sparse temporal feature tracks are obtained per view between key-frames to initialise the joint estimation. This allows robust 4D understanding in the presence of large non-rigid motion between frames. An initial reconstruction is obtained for each object in the scene combining the initial semantic instance segmentation with the sparse reconstruction [36]. The initial reconstruction and semantic instance segmentation is refined for each object instance through novel joint optimisation of segmentation, shape, and motion constrained by 3D human-pose (Section 2.1). Key-frames are used to introduce robust temporal coherence in the joint estimation across long-sequences with large non-rigid deformation. Depth, motion and semantic instance segmentation is combined across views between frames for 4D temporally coherent reconstruction and dense per-pixel semantic coherence for final 4D understanding of scenes (Section 3).

2.1 Joint per-view optimisation

Existing methods for semantic segmentation do not give instance level segmentation of the scene. Previous approach either segment the image followed by a per-segment object category classification [35, 18], give deep per-pixel CNN features followed by per-pixel classification in the image [15, 20] or predict semantic segmentation from raw pixels [32] followed by conditional random fields [28, 60]. A recent state-of-the-art method gives a good estimate of initial semantic instance segmentation masks from an image of complex sequence [21]. We employ this approach to predict initial semantic instance segmentation pre-trained parameters on MS-COCO[31] and PASCAL VOC12 [14] for each view. Per-view semantic instance segmentation is combined across views with sparse reconstruction to obtain an initial reconstruction for each frame [36], this is refined through a joint scene understanding optimisation.

The goal of the joint estimation is to refine initial semantic instance segmentation and reconstruction by assigning a label from a set of classes obtained from initial semantic instance segmentation $\mathscr{L}=\left\{l_{1},...,l_{\left|\mathscr{L}\right|}\right\}$ ( $\left|\mathscr{L}\right|$ is the total number of classes), a depth value from a set of depth values $\mathscr{D}=\left\{d_{1},...,d_{\left|\mathscr{D}\right|-1},\mathscr{U}\right\}$ (each depth value is sampled on the ray from camera and $\mathscr{U}$ is an unknown depth value to handle occlusions), and a motion flow field $\mathscr{M}=\left\{m_{1},...,m_{\left|\mathscr{M}\right|}\right\}$ simultaneously for the region $\mathscr{R}$ of each object per view. $\left|\mathscr{M}\right|$ is the pre-defined discrete flow-fields for pixel $p=(x,y)$ in image $I$ by $m=(\delta x,\delta y)$ in time. Joint semantic instance segmentation, reconstruction and motion estimation is achieved by global optimisation of a cost function over unary $E_{unary}$ and pairwise $E_{pair}$ terms, defined as:

[TABLE]

where, $d$ is the depth, $l$ is the class label, and $m$ is the motion at pixel $p$ . Novel terms are introduced for flow $E_{f}$ , motion regularisation $E_{r}$ and human-pose $E_{p}$ costs, explained in Section 2.1.3 and 2.1.2 respectively. Results of the joint optimisation with and without pose ( $E_{p}$ ) and motion ( $E_{f}$ , $E_{r}$ ) information are presented in Figure 3, showing the improvement in results. Ablative analysis on individual costs in Section 4 show the improvement in performance with the novel introduction of motion and pose constraints in the joint optimisation. Standard unary terms for depth ( $E_{d}$ ), semantic ( $E_{sem}$ ), and appearance ( $E_{a}$ ) costs are used [36], explained in Section 2.1.5. Standard pairwise terms colour contrast ( $E_{c}$ ) is used to assist segmentation and smoothness ( $E_{s}$ ) cost ensures that depth vary smoothly in a neighbourhood, are explained in Appendix A of the supplementary material.

Global optimisation of Equation 1 is performed over all terms simultaneously, using the $\alpha$ -expansion algorithm by iterating through the set of labels in $\mathscr{L}\times\mathscr{D}\times\mathscr{M}$ [7]. Each iteration is solved by graph-cut using the min-cut/max-flow algorithm [6]. Convergence is achieved in 7-8 iterations.

2.1.1 Spatio-temporal coherence in the optimisation

Constraints are applied on the spatial and temporal neighborhood to enforce consistency in the appearance, semantic label, 3D human pose and motion across views and time.

Spatial coherence: Multi-view spatial coherence is enforced in the optimisation such that the motion, shape, appearance, 3D pose and class labels are consistent across views using an 8-connected spatial neighbourhood $\psi_{S}$ for each camera view such that the set of pixel pairs $(p;q)$ belong to the same frame.

Temporal coherence: Temporal coherence is enforced in the joint optimisation by enforcing coherence across key-frames to handle large non-rigid motion and to reduce errors in sequential alignment for long sequences in the 4D scene understanding. Sparse temporal feature correspondences are used for key-frame detection and robust initialisation of the joint optimisation. They measure the similarity between frames and unlike optical flow are robust to large motions and visual ambiguity. To achieve robust temporal coherence in the 4D scene understanding framework for large non-rigid motion, sparse temporal feature correspondences in 3D are obtained across the sequence.

The temporal neighbourhood is defined for each frame between its respective key-frames. Sparse temporal correspondence tracks define the temporal neighbourhood $\psi_{T}=\left\{\left(p,q\right)\mid q=p+e_{i,j}\right\}$ ; where $j=\left\{t-1,t+1\right\}$ and $e_{i,j}$ is the displacement vector from image $i$ to $j$ .

2.1.2 Human-pose constraints $E_{p}(l,d,m)$

We use 3D human-pose to constrain joint optimisation and improve the flow, reconstruction and instance segmentation, in both 2D and 3D for dynamic scenes with multiple interacting people (see Figure 1). 3D human-pose is used as it is consistent across multiple views unlike 2D human-pose. A state-of-the-art method for 3D human-pose estimation from multiple cameras [47] is used in the paper. Previous work on 3D pose estimation [46] iteratively builds a 3D model of human-pose consistent with 2D estimates of joint locations and prior knowledge of natural body pose. In [47], multiple cameras are used when estimating the 3D model; this then feeds back into new estimates of the 2D joint locations in each image. This approach allows us to take full advantage of 3D estimates of pose, consistent across all cameras when finding fine grained 2D correspondences between images, and leading to more lifelike, vivid human reconstructions.

Initial semantic reconstruction is updated if the 3D pose of the person lies outside the region $\mathscr{R}$ by dilating the boundary to include the missing joints. This allows for more robust and complete reconstruction and segmentation. We use a standard set of 17 joints [47] defined as $\mathscr{B}$ . A circle $\mathscr{C}_{i}$ is placed around the joint position in 2D and a sphere $\mathscr{S}_{i}$ is placed around the joint position in 3D based on the confidence map to identify the nearest neighbour vertices for every joint $b_{i}$ .

[TABLE]

3D shape term: This term constrains the reconstruction in 3D such that the neighbourhood points around the joints do not move far from the respective joints, and is defined as:

[TABLE]

where $\Phi(p)$ is the 3D projection of pixel $p$ . The Frobenius norm $\left\|O\right\|_{F}=\left\|\begin{bmatrix}\Phi(p)&b_{i}\end{bmatrix}\right\|_{F}$ is applied on the 3D points in all directions to obtain the ‘net’ motion at each pixel within $\mathscr{S}_{i}$ and $\sigma_{S_{D}}=\left\langle\frac{\left\|O\right\|_{F}^{2}}{\vartheta_{\Phi(p),b_{i}}}\right\rangle$ .

3D motion term: This enforces as rigid as possible [43] constraint on 3D points in the neighbourhood of each joint $b_{i}$ in space and time. An optimal rotation matrix $R_{i}$ is estimated for each $b_{i}$ by minimising the energy defined as:

[TABLE]

2D term: 3D poses are back-projected in each view to constrain per view appearance ( $e_{2d}^{L}$ ), semantic segmentation ( $e_{2d}^{S}$ ) and motion estimation ( $e_{2d}^{M}$ ) in 2D. If $p\in\mathscr{C}_{i}$ ,

[TABLE]

where, $\Pi$ is the back-projection of 3D poses to 2D, $N_{pose}$ is the number of nearest neighbours, $\sigma_{S_{L}}=\left\langle\frac{\left\|\Pi(b_{i})-q\right\|^{2}}{\vartheta_{\Pi(b_{i}),q}}\right\rangle$ and, $\sigma_{S_{S}}$ and $\sigma_{S_{M}}$ is defined similarly. $e_{2d}^{L}(l)$ and $e_{2d}^{S}(l)$ ensures that the pixels around projected 3D pose $\Pi(b_{i})$ have the same semantic label and appearance across views ( $\psi_{S}$ ) and time ( $\psi_{T}$ ) thereby ensuring spatio-temporal appearance and semantic consistency respectively.

2.1.3 Motion constraints- $E_{f}(m)\text{ and }E_{r}(l,m)$

Flow term: This term is obtained by integrating the sum of three penalisers over the reference image domain inspired from [45], defined as:

$E_{f}({p,m_{p}})=e_{F}^{T}({p,m_{p}})+e_{F}^{V}({p,m_{p}})+e_{F}^{S}({p,m_{p}})$

where, $e_{F}^{T}({p,m_{p}})=\sum_{i=1}^{N_{v}}\left\|(I_{i}(p,t)-I_{i}(p+m_{p},t+1))\right\|^{2}$ penalises deviation from the brightness constancy assumption in a temporal neighbourhood for the same view; $e_{F}^{V}({p,m_{p}})=\sum_{t\in\psi_{T}}\sum_{i=2}^{N_{v}}\left\|(I_{1}(p,t)-\\ I_{i}(p+m_{p},t))\right\|^{2}$ penalises deviation in appearance from the brightness constancy assumption between the reference view and other views at other time instants; and $e_{F}^{S}({p,m_{p}})=0\text{ if }p\in N\text{ otherwise }\infty$ which forces the flow to be close to nearby sparse temporal correspondences. $I_{i}(p,t)$ is the intensity at point $p$ at time $t$ in camera $i$ . The flow vector $m$ is located within a window from a sparse constraint at $p$ and it forces the flow to approximate the sparse 2D temporal correspondences.

Motion regularisation term: This penalises the absolute difference of the flow field to enforce motion smoothness and handle occlusions in areas with low confidence [45].

$E_{r}({l,m})=\sum_{p,q\in N_{p}}\left\|\Delta m\right\|^{2}\lambda_{r}^{L}e_{r}^{L}(p,q,m_{p},m_{q},l_{p},l_{q})+\lambda_{r}^{A}e_{r}^{A}(p,q,m_{p},m_{q},l_{p},l_{q})$

where $\Delta m=m_{p}-m_{q}$ and;

$e_{r}^{X}=\underset{l_{p}=l_{q}}{\forall}\text{ }\underset{q\in N_{p}}{\text{mean}}\text{ }E_{X}({q,m_{q}})-\underset{q\in N_{p}}{\min}E_{X}({q,m_{q}})$ else [math]. We compute $e_{R}^{L}$ (semantic regularisation) and $e_{R}^{A}$ (appearance regularisation) as the minimum subtracted from the mean energy within the search window $N_{p}$ for each pixel $p$ .

2.1.4 Long-term temporal coherence

Sparse temporal correspondences: The sparse 3D points projected in all views are matched between frames $N_{f}^{i}$ and key-frames across the sequence using nearest neighbour matching [33] followed by a symmetry test which employs forward and backward match consistency by performing two-way matching to remove the inconsistent correspondences. This gives sparse temporal feature correspondence tracks per frame for each object: $F^{c}_{i}=\{{f^{c}_{1},f^{c}_{2},...,f^{c}_{R_{i}^{c}}}\}$ , where $c={1\text{ to }N_{v}}$ . $R_{i}^{c}$ are the 3D points visible at each frame $i$ . Exhaustive matching is done, such that each frame is matched to every other frame to handle appearance, reappearance and disappearance of points between frames.

Key-frame detection: Previous work [40, 39] showed that sparse key-frames allow robust long-term correspondence for 4D reconstruction. In this work we introduce the additional use of pose in the detection and sparse temporal feature correspondence across key-frames to prevent the accumulation of errors in long sequences. 4D scene alignment between key-frames is explained in Section 3.

Key-frame similarity metric is defined as:

[TABLE]

Key-frame detection exploits sparse correspondence ( $M_{i,j}^{c}$ ), pose ( $P_{i,j}^{c}$ ), shape ( $I_{i,j}^{c}$ ), semantic ( $I_{i,j}^{c}$ ) and distance ( $D_{i,j}^{c}$ ) information across views $N_{v}$ between frame $i$ and $j$ for each object in view $c$ , to improve the long-term temporal coherence of the proposed method, using similar frames across the sequence, illustrated in Figure 4. All frames with similarity $>0.75$ in a sequence are selected as key-frames defined as $K=\{{k^{1},k^{2},...,k^{N_{k}}}\}$ where $N_{k}$ is the number of key-frames and $N_{f}^{i}$ is the number of frames between $K_{i}$ and $K_{i+1}$ . All the metrics used in 5 and an ablation study for key-frame detection is given in detail in Appendix B of supplementary material.

Features at view $c$ frame $i$ , $F^{c}_{i}$ are matched to features at view $c$ to frames $j=\{{i+1,...,N_{f}^{i}}\}$ to give correspondences for all the frames $N_{f}^{i}$ with key-frame $K_{i}$ . The corresponding joint locations from the 3D pose are back-projected in each view and added to sparse temporal tracks in between key-frames. Any new point-tracks are added to the list of point tracks for key-frame $K_{i}$ .

2.1.5 Unary terms - $E_{unary}(l,d,m)$

Depth term: This gives a measure of photo-consistency between views $E_{d}(d)=\sum_{p\in\psi_{S}}e_{d}(p,d_{p})$ , defined as:

[TABLE]

where $M_{\mathscr{U}}$ is the fixed cost of labelling pixel unknown and $q$ denotes the projection of the hypothesised point $P$ ( $3D$ point along the optical ray passing through pixel $p$ located at a distance $d_{p}$ from the camera) in an auxiliary camera. $\mathscr{O}_{k}$ is the set of the $k$ most photo-consistent pairs with reference camera and $m(p,q)$ is inspired from [37].

Appearance term: This term is computed using the negative log likelihood [6] of the colour models (GMMs with 10 components) learned from the initial semantic mask in the temporal neighbourhood $\psi_{T}$ and the foreground markers obtained from the sparse 3D features for the dynamic objects. It is defined as:

$E_{a}(l)=\sum_{p\in\psi_{T}}\sum_{p\in\psi_{S}}-\log P(I_{p}\rvert l_{p})$

where $P(I_{p}\rvert l_{p}=l_{i})$ denotes the probability of pixel $p$ belonging to layer $l_{i}$ .

Semantic term: This term is based on the probability of the class labels at each pixel based on [10], defined as:

$E_{sem}(l)=\sum_{p\in\psi_{T}}\sum_{p\in\psi_{S}}-\log P_{sem}(I_{p}\rvert l_{p})$

where $P_{sem}(I_{p}\rvert l_{p}=l_{i})$ denotes the probability of pixel $p$ being in layer $l_{i}$ in the reference image obtained from initial semantic instance segmentation [21].

3 4D scene understanding

The final 4D scene model fuses the semantic instance segmentation, depth information and dense flow across views and in time between frames ( $N_{f}^{i}$ ) and key-frames ( $K_{i}$ ). The initial instance segmentation, human pose and motion information for each object is combined to obtain final instance segmentation of the scene. The depth information is combined across views using Poisson surface reconstruction [24] to obtain a mesh for each object in the scene. 4D temporally coherent meshes are obtained by combining the most consistent motion information from all views for each 3D point. This is combined with spatial semantic instance information to give per-pixel semantic and temporal coherence. Appearing, disappearing, and reappearing regions are handled by using the sparse temporal tracks and their respective motion estimate. The dense flow and semantic instance segmentation together with 3D models of each object in the scene gives the final 4D understanding of the scenes. Examples are shown in Figure 1 and 5 on two datasets, where objects are coloured in one key-frame and colours are propagated reliably between frames and key-frames across the sequence for robust 4D scene modelling.

4 Results and evaluation

Joint semantic instance segmentation, reconstruction and flow estimation (section 2) is evaluated quantitatively and qualitatively against $15$ state-of-the-art methods on a variety of publically available multi-view indoor and outdoor dynamic scene datasets, detailed in Table 2. More results are provided in supplementary material Appendix C.

Algorithm parameters listed in Table 3 are the same for all outdoor datasets, and for indoor datasets parameters depend on the number of cameras ( $N_{v}$ ). Pairwise costs are constant $\lambda_{p}=0.9$ , $\lambda_{c}=\lambda_{s}=\lambda_{r}=0.5$ for all datasets.

4.1 Reconstruction evaluation

The proposed approach is compared against state-of-the-art approaches for semantic co-segmentation and reconstruction (SCSR) [36], piecewise scene flow (PRSM) [52], multi-view stereo (SMVS) [29], and deep learning based stereo approaches (LocalStereo) [44]. Qualitative comparison with 2 views of proposed method are shown in Figure 6. Pre-trained parameters were used for LocalStereo and per-view depth maps were fused using Poisson reconstruction. The quality of surface obtained using proposed method is improved compared to state-of-the-art methods. In contrast to previous approaches, limbs of people are reliably reconstructed because of the exploitation of human-pose and temporal information (motion) in the joint optimisation.

For quantitative comparison to state-of-the-art methods, we project the reconstruction onto different views and compute the projection errors shown in Table 4. A significant improvement is obtained in projected surface completeness with the proposed approach.

4.2 Segmentation evaluation

Our approach is evaluated against a variety of state-of-the-art multi-view (SCV [48], SCSR [36], and JSR [17]) and single-view (Dv3+ [9], MRCNN [21], PSP [59], CRF RNN [60], and Segnet [3]) segmentation methods, shown in Figure 7. For fair evaluation against single-view semantic segmentation methods, multi-view consistency is applied for segmentation estimated from each view to obtain multi-view consistent semantic segmentation using dense multi-view correspondence. Colour in the results is kept from the original papers. Only MRCNN and the proposed approach gives instance segmentation.

Quantitative evaluation against state-of-the-art methods is measured by Intersection-over-Union with ground-truth, shown in Table 5. Ground-truth is available on-line for most of the datasets and obtained by manual labelling for other datasets. Pre-trained parameters were used for semantic segmentation methods. The semantic instance segmentation results from the joint optimisation are significantly better compared to the state-of-the-art methods ( $\approx 20-40\%$ ).

4.3 Motion evaluation

Flow from the joint estimation is evaluated against state-of-the-art methods: (a) Dense flow algorithms DCflow [57] and Deepflow [54]; (b) Scene flow methods PRSM [52]; and (c) Non-sequential alignment of partial surfaces 4DMatch [38] (requires a prior 3D mesh of the object as input for 4D reconstruction). The key-frames of sequence are coloured and the colour is propagated using dense flow from the joint optimisation throughout the sequence. The red regions in 2D dense flow in Figure 8 are the regions for which reliable correspondences are not found. This demonstrates improved performance using the proposed method. The colours in the 4D alignment in Figure 9 are not reliably propagated by DCFlow for limbs.

We also compare the silhouette overlap error ( $S_{e}$ ) across frames, key-frames and views to evaluate long-term temporal coherence in Table 6 for all datasets. This is defined as $S_{e}=\frac{1}{N_{v}N_{k}N_{f}^{i}}\sum_{i=1}^{N_{k}}\sum_{j=1}^{N_{f}^{i}}\sum_{c=1}^{N_{v}}\frac{\text{Area of intersection}}{\text{Area of semantic segmentation}}$ . Dense flow in time is used to obtain the propagated mask for each image. The propagated mask is overlapped with semantic segmentation at each time instant to evaluate the accuracy of the propagated mask. The lower the $S_{e}$ the better. Our approach gives the lowest error demonstrating higher accuracy compared to the state-of-the-art methods.

4.4 Ablation study on Equation 1

We perform an ablation study on Equation 1, such that we remove motion $E_{f},E_{r}$ , pose $E_{p}$ and semantic $E_{sem}$ constraints from the equation, defining $P_{M}=E-E_{f}-E_{r},P_{P}=E-E_{p},P_{PM}=E-E_{f}-E_{r}-E_{p},P_{S}=E-E_{sem}$ and $P_{PS}=E-E_{sem}-E_{p}$ . Reconstruction, flow and semantic segmentation is obtained with removed constraints, and the results are shown in Tables 4, 6 and 5 respectively. The proposed approach gives best performance with joint pose, motion and semantic constraints.

4.5 Limitations

Gross errors in initial semantic instance segmentation and 3D pose estimation lead to degradation in the quality of results (e.g. the cars in Juggler2 - Figure 7). Although 3D human pose helps in robust 4D reconstruction of interacting people in dynamic scenes, current 3D pose estimation is unreliable for highly crowded environments resulting in degradation of the proposed approach.

5 Conclusions

This paper introduced the first method for unsupervised 4D dynamic scene understanding from multi-view video. A novel joint flow, reconstruction and semantic instance segmentation estimation framework is introduced exploiting 2D/3D human-pose, motion, semantic, shape and appearance information in space and time. Ablation study on the joint optimisation demonstrates the effectiveness of the proposed scene understanding framework for general scenes with multiple interacting people. The semantic, motion and depth information per view is fused spatially across views for 4D semantically and temporally coherent scene understanding. Extensive evaluation against state-of-the-art methods on a variety of complex indoor and outdoor datasets with large non-rigid deformations demonstrates a significant improvement in the accuracy in semantic segmentation, reconstruction, motion estimation and 4D alignment.

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] 4d repository, http://4drepository.inrialpes.fr/. In Institut national de recherche en informatique et en automatique (INRIA) Rhone Alpes .
2[2] Multiview video repository, http://cvssp.org/data/cvssp 3d/. In Centre for Vision Speech and Signal Processing, University of Surrey, UK .
3[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI , 2017.
4[4] L. Ballan, G. J. Brostow, J. Puwein, and M. Pollefeys. Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Trans. Graph. , 29(4):1–11, 2010.
5[5] T. Basha, Y. Moses, and N. Kiryati. Multi-view scene flow estimation: A view centered variational approach. In CVPR , pages 1506–1513, 2010.
6[6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. TPAMI , 26(11):1124–1137, 2004.
7[7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. TPAMI , 23(11):1222–1239, 2001.
8[8] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR , 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

U4D: Unsupervised 4D Dynamic Scene Understanding

Abstract

1 Introduction

2 Joint 4D dynamic scene understanding

2.1 Joint per-view optimisation

2.1.1 Spatio-temporal coherence in the optimisation

2.1.2 Human-pose constraints Ep(l,d,m)E_{p}(l,d,m)Ep​(l,d,m)

2.1.3 Motion constraints- Ef(m) and Er(l,m)E_{f}(m)\text{ and }E_{r}(l,m)Ef​(m) and Er​(l,m)

2.1.4 Long-term temporal coherence

2.1.5 Unary terms - Eunary(l,d,m)E_{unary}(l,d,m)Eunary​(l,d,m)

3 4D scene understanding

4 Results and evaluation

4.1 Reconstruction evaluation

4.2 Segmentation evaluation

4.3 Motion evaluation

4.4 Ablation study on Equation 1

4.5 Limitations

5 Conclusions

2.1.2 Human-pose constraints $E_{p}(l,d,m)$

2.1.3 Motion constraints- $E_{f}(m)\text{ and }E_{r}(l,m)$

2.1.5 Unary terms - $E_{unary}(l,d,m)$