Super-Trajectories: A Compact Yet Rich Video Representation
Ijaz Akhter, Cheong Loong Fah, Richard Hartley

TL;DR
This paper introduces super-trajectories, a novel video representation that combines dense trajectory over-segmentation with constraints to reduce tracking errors, enhancing long-term video analysis.
Contribution
It presents a new compact video representation that maintains long-term pixel tracking information while addressing trajectory tracking errors.
Findings
Provides a more informative video segmentation than traditional superpixels.
Reduces tracking errors through edge constraints and similarity measures.
Enhances trajectory-based video analysis applications.
Abstract
We propose a new video representation in terms of an over-segmentation of dense trajectories covering the whole video. Trajectories are often used to encode long-temporal information in several computer vision applications. Similar to temporal superpixels, a temporal slice of super-trajectories are superpixels, but the later contains more information because it maintains the long dense pixel-wise tracking information as well. The main challenge in using trajectories for any application, is the accumulation of tracking error in the trajectory construction. For our problem, this results in disconnected superpixels. We exploit constraints for edges in addition to trajectory based color and position similarity. Analogous to superpixels as a preprocessing tool for images, the proposed representation has its applications for videos, especially in trajectory based video analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Video Surveillance and Tracking Methods
Super-Trajectories: A Compact Yet Rich Video Representation
Ijaz Akhter
National University of Singapore
&Cheong Loong Fah
National University of Singapore
&Richard Hartley
Australian National University
Abstract
We propose a new video representation in terms of an over-segmentation of dense trajectories covering the whole video. Trajectories are often used to encode long-temporal information in several computer vision applications. Similar to temporal superpixels, a temporal slice of super-trajectories are superpixels, but the later contains more information because it maintains the long dense pixel-wise tracking information as well. The main challenge in using trajectories for any application, is the accumulation of tracking error in the trajectory construction. For our problem, this results in disconnected superpixels. We exploit constraints for edges in addition to trajectory based color and position similarity. Analogous to superpixels as a preprocessing tool for images, the proposed representation has its applications for videos, especially in trajectory based video analysis.
K****eywords Superpixels Trajectories Segmentation
1 Introduction
Trajectories — 2D tracks of feature points along time, are often used in computer vision to model long temporal information in a video. A majority of these methods involve sparse trajectories[1, 2, 3], though dense trajectories has also been explored [4, 5]. Many inference problems require finding pair-wise similarities or affinities between trajectories [6, 7]. However, finding pair-wise affinites among a large neighbourhood in dense trajectories is often not tractable. Temporal superpixels can be used to find a compact representation of the video but they do not keep the trajectory information and by using only the central trajectory, long pixel-level tracking information is lost. In this paper, we propose super-trajectories as an over-segmentation of a video into clusters of dense trajectories. The goal is to have them connected as temporal superpixels. But in contrast to temporal superpixels, where labels are assigned at pixel level, in our approach, each trajectory bears a single label (See Fig 1). With our approach, the affinity between all pairs of trajectories in two super-trajectories can be found. More importantly, the segmentation into clusters allows us to perform intelligent sampling and save computation time. This is especially crucial for those motion segmentation approaches based on the hypothesis-and-test paradigm, where one needs to hypothesize homographies or fundamental matrices based on a minimal set of matches between all pairs of frames. Leveraging the cluster information will significantly raise the chance of getting all inliers for the motion hypothesis. Once a robust set of trajectories have been used to form accurate homographies or fundamental matrices, the latter can be fit on those less accurate trajectories.
Finding accurate dense trajectories covering every pixel in the video is a challenging problem. To accommodate new and occluding scene and avoid accumulation of tracking error, we propose a consistency test for optical flow by exploiting the color and the edge boundaries in the image sequence. We find trajectories by the composition of optical flow based warps and use the proposed consistency test to decide whether to break or continue a trajectory. This usually results in a large number of trajectories (up to millions) of both long and short durations (due to occlusion). Clustering these trajectories is a classification problem of big and highly incomplete data of high dimensions and is quite challenging. Please note that the spectral clustering based clustering methods [8] mostly require estimation of an affinity matrix for every pair of trajectories and are not tractable for dense trajectories. Second, due to the accumulation of tracking errors, even the tracks corresponding to two nearby points that should be grouped together, can have large distances in some of the images. Consequently, many neighbouring trajectories would only be partially connected and ensuring that super-trajectories would result in connected superpixels throughout the video is very hard. This requires us to redefine the notion of neighbours and connectivity for trajectories. The goal of the proposed method is to minimize disconnections of the super-trajectories, or effectively those of the underlying superpixels.
Apart from the proposed trajectory estimation, the main contribution of the paper is an iterative algorithm, where each iteration reduces the number of disconnections until convergence. Part of each iteration is an adaptation of the recently proposed, non-iterative superpixeling algorithm SNIC [9], for trajectories. Specfically, we define for trajectories color and position based similarity; we also propose edge based similarity constraints for trajectories for better localization of superpixels. In contrast to superpixels based methods, the final post-processing step to filter out small isolated regions of labels is more complicated for trajectories than pixels because neighbouring trajectories are often only partially connected. The proposed post-processing step relabels the isolated trajectories to minimize the disconnections. The proposed method is able to track superpixels much longer than the previous methods, while also slightly improving on the under-segmentation error on Chen dataset [10].
The design choice of including the whole trajectory in a super-trajectory comes at a cost. Due to accumulation of tracking error and drift, the boundaries of the corresponding superpixels, sometimes, cannot be accurately localized. As a result, super-trajectories segmentation accuracy and boundary recall are slightly worse than the existing temporal superpixels methods [11, 12]. Nevertheless, this problem happens only for a very small number of super-trajectories; the rest of them are perceptually of similar quality to the temporal superpixels while carrying more information in the form of dense pixel-wise tracking. With the help of more accurate optical flow, super-trajectories segmentation should further improve in accuracy.
2 Related Work
Image segmentation into superpixels is a widely studied problem in Computer Vision. Here we only discuss some of prominent works in this area. Normalized cuts algorithm, by Shi and Malik, uses contour and texture cues to recursively partition the image using a pixel graph [13]. Meanshift, proposed by Comaniciu and Meer, is a local mode seeking algorithm on the color and position space to find segment of the image [14]. Quickshift, by Vedaldi and Soatto, is also a mode seeking scheme but more efficient than meanshift [15]. SEEDS, by Van den Bergh et al., is a coarse-to-fine method to refine superpixel boundaries through an energy-driven sampling [16]. SLIC, by Achanta et al., is an optimized K-means clustering algorithm on color and position features [17]. A more comprehensive list of superpixel clustering algorithms and their evaluation is available in [18].
In contrast to single-frame superpixels, temporal superpixels are not extensively studied. Extending superpixels along time requires enforcing temporal continuity. Several temporal superpixels method require optical flow but dealing with inaccuracies in optical flow is not a trivial problem. Van den Bergh et al. proposed an extension of SEEDS to get temporal superpixel segmentation of video in an online fashion [19]. Reso et al. used K-means algorithm in a temporal sliding window fashion to impose temporal consistency [20]. Chang et al. proposed a graphical model to find temporally consistent superpixels [11]. Grundmann et al. proposed a hierarchical video segmentation technique, referred to as GBH, based on appearance and regions graphs and also discussed its streaming and parallelizable version [21]. Xu et al. proposed a streaming video segmentation method based on GBH under a Markov assumption [22]. Veksler et al. proposed a graph-cut based segmentation technique for image and video segmentaion [23]. All of these methods assign labels to pixels and do not exploit dense trajectories and do not ensure a single label for a trajectory.
A somewhat related problem to ours is trajectory based motion and video segmentation. The goal of these methods is usually to find generic objects given a set of dense or sparse trajectories. Yan and Pollefeys segmented trajectories into articulated, rigid, non-rigid, degenerate and non-degenerate classes by finding a linear manifold embedding [1]. Rao et al. proposed a subspace clustering scheme to find out multiple moving objects in the video [6]. Ochs and Brox proposed a variational approach to obtain dense segmentation from sparse trajectories [2]. Fragkiadaki et al. exploited the discontinuity in a trajectory embedding to segment out the objects [3]. Ochs et al. used long-trajectories and the affinities between them to segment different objects in the video [24]. Keuper et al. cast the motion segmentation as a minimum cost multicut problem [25]. Wang et al. proposed a semi-supervised method to segment a foreground object from a video by clustering trajectories [5]. They also coined the term super-trajectories for a trajectory cluster. In contrast to their work, we discuss the problem of trajectory clustering with the goal that a temporal slice of super-trajectory should be a super-pixel and the disconnectivity among the trajectories in the cluster should be minimized.
3 Trajectory Based Video Representation
3.1 Optical Flow to Trajectories
Optical flow provides a dense correspondence of pixels between two images. In order to exploit long-term temporal consistency, we convert the flow into long trajectories. Given forward and backward flow and edge images of frames, each of height and width , the goal of this section is, to construct trajectories covering every pixel in the sequence. Each trajectory, , where , is an vector of tuples, each carrying 2D image coordinates and some of them may consist of missing values due to pixels’ occlusion. , because of new scene entering into the field of view, and also for every occlusion, new pixels appear. We stack all the trajectories into a , sparse matrix, , which we estimate as the following.
The forward and the backward flows can be used to find current to previous and previous to current maps of pixel coordinates for the frame , and respectively, where . The composition of these maps gives the required dense trajectories. The key challenge in doing so is first finding out which pixels belong to a new scene so that the existing trajectories should be terminated and the new trajectories could be formed. Traditionally, the forward and backward optical flow consistency is used to find out new scene in the images[26] . This, however, results in breaking up good tracks and generating a large number of trajectories of short durations. We combine optical flow with color and edge boundary information and proposed a more robust criterion to find out new regions as follows.
We find , and as distance matrices for optical flow, color and edge boundary for the frame, where is the the Euclidean distance between and the inverse of , is the Euclidean distance between 3-channel previous color image and the warping of the image to the previous image, using and is estimated from the edge boundary image of the previous frame and the warped boundary image as follows,
[TABLE]
where both and are element-wise functions and is a constant and was set equal to 4. We find the joint distance as follows,
[TABLE]
where is a constant (we set this equal to 20) and denotes element-wise product. We define the optical flow at pixel as inconsistent if , where is a constant. Smaller gives more trajectories of smaller duration and higher accuracy and vice-versa. We discuss the choice of later in this section.
The area of inconsistent flow is considered as the occluded region and we set the corresponding flow as undefined. This helps us initialize new trajectories corresponding to the new regions. The goal of trajectory construction is to find, , the mapping of every frame w.r.t the first frame. For the pixels not visible in the first frame, the mapping is defined w.r.t their first occurrence in a later frame. A row-wise stacking of for all , gives the matrix
The first two rows in would simply be and , where represents an enumeration of the 2D coordinates of all the pixels in the first frame. The occluded region in the frame would initialize new trajectories. The coordinates of the trajectory in the frame, can be obtained from and , as follows,
[TABLE]
where gives the mapped coordinates of the pixel location in . In practice, since the flow would only give the mappings of discrete locations in the frame, bilinear interpolation needs to be done to find the value at floating coordinates given by . Please note that Equation 3 is valid for both existing and new trajectories because we allow to also be a new point starting from the frame 2. Generalizing the above equation, can be obtained as the following recursive composition,
[TABLE]
For every occluded region in the frame , we initialize new trajectories and concatenate coordinates of the new region as additional columns in . Hence all the pixels in the video are covered. To simplify the notation, we write as just in the rest of the paper.
The goal of this paper is to cluster trajectories into classes. Let the matrix denotes the labels of the trajectories. By enforcing single output label for the entire duration of a trajectory, we not only reduce the number of unknowns but also explicitly enforce a long-term temporal consistency. Given the trajectory labels in , the pixel labels for the frame , can be found as a matrix using the corresponding trajectory coordinates in . We introduce a function to denote this estimation of as follows,
[TABLE]
where converts into pixels with values taken from .
Analogous to the position matrix, , a color matrix can also be formed, where its column, represents the 3D color values of the corresponding pixel locations in the trajectory, . This can be done using the following function
[TABLE]
where is the 3-channel color image and represents the trajectory colors in the frame and plays an inverse role to . Similarly if edge boundaries for all the frames are given, then we can find an matrix , consisting of the edge boundary values of the trajectories. Hence in the trajectory based video representation, certain features like color, edges, and positions can be described and compared at the trajectory level rather than the pixel level.
In Fig 2 we give a qualitative comparison of the trajectories estimation using optical flow based forward-backward consistency check (OF-consistency) and the proposed method against and . gave the same number of trajectories as OF-consistency, whereas gave roughly fewer trajectories. We find the average trajectory colors and then regenerate the frames using the formula, . Better trajectories are expected to generate a sharper image back. The proposed method with generates overall sharper images than OF-consistency with the same number of trajectories, while generates fewer trajectories and still mostly sharp images. In our experiments we set .
The estimated trajectories are then used for clustering. Before describing the proposed method, we need to first define a few primitives of this representation.
3.2 Trajectory Primitives
Two trajectories are neighbours if by rounding the coordinates, the corresponding pixels are neighbours in at least one frame. We denote as the neighbours of the trajectory .
For a trajectory , we define its disconnectivity, as a binary vector , w.r.t a set of trajectories, with label , such that is 1 at the frame , if is disconnected to and 0 otherwise, i.e.
[TABLE]
where denotes the 2D coordinates at the frame of the trajectory and , the set of 2D coordinates of at the frame . Similarly, we denote as a binary visibility mask of and define as the visibility mask of at the frame as following
[TABLE]
Finally we define the cost of connectivity of and if , by simply counting the frames in which was visible but disconnected to , i.e.
[TABLE]
where is the boolean AND operator. Finally we say that the group is fully connected if consists of fully connected trajectories, i.e. , for all , and .
We also need to define the energy function we want to minimize for super-trajectory clustering. Traditionally, this energy consists of color and position based terms and can be defined for a trajectory , as follows
[TABLE]
where is proportional to the squared Euclidean distance between the frame-level 2D coordinates of some center trajectory and , averaged over the visible frames and is an analogous energy term for the color. We modify the energy by also considering the already labeled neighbouring trajectories, of as follows
[TABLE]
where and denotes the edge boundary of the trajectory in the frame and is a constant, the summation is over the frames where both the trajectories were visible and is the number of these frames. Since minimzing the sum of the energies for all trajectories is a difficult optimization, we instead approximate it with a greedy optimization by selecting and labeling the trajectory with the least energy as discussed in the next section.
4 Super-Trajectory Representation
We propose an iterative procedure for super-trajectory clustering, where in each iteration, we adapt SNIC [9] algorithm for trajectories and improve the connectivity until convergence. Once the trajectory labels are found, the pixel labels of the frame , are found, using Equation 5. Here we describe different parts of the algorithm.
4.1 The Core Algorithm
The proposed iterative method is given in Algorithm 1, where each iteration reduces the number of disconnected trajectories and calls TNIC, the adaptation of SNIC for trajectories, given in the Algorithm 2, as a sub-routine. TNIC starts with a labeling of subset of fully connected trajectories and uses them to find cluster centers (line 2). It finds the distances of neighbours of the unlabeled trajectories and pushes them in a queue (lines 3-7). Then in the while loop, the trajectory corresponding to the smallest distance popes out, gets its label and updates the cluster centers (lines 12-15). After this the unlabeled neighbours of the trajectory are added to the queue (lines 16-19). The loop terminates when the queue is empty.
Algorithm 1 starts by finding a few trajectories as the seed cluster centers. stores the labels of fully connected clusters and grows in each iteration. Initially only the seed cluster centers are labeled and the rest are labelled 0 (line 5). Each iteration of the while loop finds intermediate labeling using TNIC (line 7). For the first iteration, we find the largest subset of fully connected trajectories using connected component labeling and set accordingly (line 9). For the remaining iterations, we simply grow if the corresponding label in makes them fully connected with the existing fully connected trajectories (line 11-19). The loop terminates when the disconnected trajectories cannot be reduced anymore. A post-processing is done to get the final labeling (line 25).
The initialization and the post-processing are discussed in the following sections.
4.2 Initialization
Ideally the seed cluster centers should be evenly separated, ensuring that all the trajectories lie within a circle of diameter roughly equal to the required average diameter of the superpixels . For pixels as input, this can be trivially done by initializing seeds on a grid. For trajectories as input, each covering multiple pixels, this, however, is nontrivial and is discussed as the following.
The seed cluster centers for the trajectories corresponding to the pixels in the frame are simply initialize along a grid, with the spacing equal to . This leaves the trajectories starting from the frame or later uncovered (i.e., not dealt with). Then we find the seeds for the uncovered trajectories ending at the last frame. After this we recursively find the middle frame among the previously considered frames, so that all the evenly spaced frames in time are considered or there are not enough uncovered trajectories left in a spatial window of size .
Let the binary vector, represent the uncovered trajectories after the selection of seeds from a frame. After the frame, the uncovered trajectories would be 111function , previously defined for color vectors, is now being used for binary vectors., where is a matrix of 1s. To find the new seeds in a frame , we first find the uncovered pixels in the corresponding frame as
[TABLE]
We convolve matrix, with an template of ones. The pixel location at the maximum in the convolution, if greater than a threshold , gives the corresponding trajectory as the next seed. If the maximum is less than then we do not look for any more seeds in the frame. Otherwise, we set the window centered at the selected pixel equal to 0 in and again find the location corresponding to the maximum sum in a window and repeat. Once done with a frame , the remaining uncovered trajectories are found as
[TABLE]
Repeated applications of equations 12 and 13 for the selected frames ensures coverage of all the frames and the selected seeds are used to initialize the cluster centers.
4.3 Post-Processing
The post-processing (given in Algorithm 3) starts with a loop for all the frames and converts the trajectory labels into a matrix, consisting of pixel labels for the frame (line 1-2). Then it filters outs the small isolated label regions to get , using the standard connected component labeling based procedure (line 3). is converted back into trajectory labeling to get the row, in the matrix (line 4). The trajectories that do not get multiple labels in are taken as clean and the corresponding labels are assigned to (line 6). Each remaining trajectory gets the label of its neighbour that gives the least connectivity cost (lines 7-12).
5 Experiments
As we discussed, our main contribution is a new video representation, which should be useful for applications involving trajectory based video analysis, but here we also compare the segmentation accuracy of super-trajectories (ST) against two state of the art temporal superpixels methods: turbo superpixels (TSP) [11] and contour-constrained superpixels (CCS) [12]. We use Full-flow [27] for optical flow estimation and convert this to trajectories using the method discussed in Section 3.1 and then cluster them using Algorithm 1. Given the trajectory labels, we estimate the pixel labels at a frame using Equation 5. The estimated pixel labels are compared against the temporal superpixels obtained from TSP and CCS. To ensure fairness, we also use the optical flow produced by Full-flow for TSP and CCS. We use LIBSVX 3.0 benchmark [28] to evaluate the accuracy of temporal superpixels. We use the seven evaluation metrics in [28]: 2D undersegmentation error (UE2D), 2D segmentation accuracy (SA2D), 2D boundary recall (BR2D), 3D undersegmentation error (UE3D), 3D segmentation accuracy (SA3D), 3D boundary recall (BR3D), and the mean duration against the desired average superpixel size, . In addition we also count the number of supervoxels for each algorithm.
Fig 3 compares the quantitative results on the Chen dataset [10], against the number of supervoxels for each algorithm. This dataset consists of eight videos, each of length roughly equal to 80 frames and 5-10 ground-truth objects per video. The figure shows that the proposed method gives better UE2D and UE3D than the previous methods. In addition the mean duration of our method is longer and it gives fewer number of supervoxels than the previous methods but other evaluation metrics, we are slightly worse off. To further investigate the source of error, in Fig 4, we give a qualitative comparison of all three methods on a few frames on Garden and Ice sequence from Chen dataset. All three methods give accurate boundaries for most of the superpixels. and the places where ST did mistakes are very few (See red arrows in the figure). The source of these mistakes is error in optical flow. Most optical flow methods exhibit a shrinking bias: they tend to be inaccurate toward small or elongate objects (such as the skater’s legs or thin tree branches). The reason CCS and TSP perform better is because they track at superpixels level whereas we maintain pixel level tracking in the form of trajectories and suffer from the accumulation of error. However the proposed method is able to track superpixel for longer durations. For a better qualitative comparison, please refer to the supplementary videos. In addition, the main benefit of our approach is that super-trajectories contain more information than temporal superpixels in the form of dense pixel-level tracking. With more accurate trajectories, results could be improved.
6 Conclusion
We propose super-trajectories, an over-segmentation of dense trajectories as a new representation for videos. The representation in terms of trajectories rather than pixels imposes long temporal consistency in a global manner. The proposed algorithms can be used to find a clustering of trajectories having minimum dis-connections among them. Super-trajectories have applications in trajectory based video analysis. We hope that the Vision community would find the proposed representation useful in a number of applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In European conference on computer vision , pages 94–106. Springer, 2006.
- 2[2] Peter Ochs and Thomas Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In Computer Vision (ICCV), 2011 IEEE International Conference on , pages 1583–1590. IEEE, 2011.
- 3[3] Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , pages 1846–1853. IEEE, 2012.
- 4[4] Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision , 103(1):60–79, 2013.
- 5[5] Wenguan Wang, Jianbing Shen, Jianwen Xie, and Fatih Porikli. Super-trajectory for video segmentation. IEEE International Conference on Computer Vision , 2017.
- 6[6] Shankar R Rao, Roberto Tron, René Vidal, and Yi Ma. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on , pages 1–8. IEEE, 2008.
- 7[7] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. ar Xiv preprint ar Xiv:1010.2955 , 2010.
- 8[8] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems , pages 849–856, 2002.
