A multimodal lossless coding method for skeletons in videos

Mingzhou Liu; Xiaoyi He; Weiyao Lin; Xintong Han; Yanmin Zhu; Hongtao; Lu; Hongkai Xiong

arXiv:1905.01790·cs.MM·May 14, 2019

A multimodal lossless coding method for skeletons in videos

Mingzhou Liu, Xiaoyi He, Weiyao Lin, Xintong Han, Yanmin Zhu, Hongtao, Lu, Hongkai Xiong

PDF

Open Access

TL;DR

This paper introduces a multimodal lossless coding method for skeleton data in videos, combining spatial and temporal redundancy reduction techniques to improve compression efficiency significantly.

Contribution

It presents the first multimodal skeleton coding tool with three schemes that adaptively switch for better compression of video skeleton data.

Findings

01

Achieves 74.4% size reduction on surveillance sequences

02

Achieves 54.7% size reduction on overall test sequences

03

Demonstrates effective lossless skeleton data compression

Abstract

Nowadays, skeleton information in videos plays an important role in human-centric video analysis but effective coding such massive skeleton information has never been addressed in previous work. In this paper, we make the first attempt to solve this problem by proposing a multimodal skeleton coding tool containing three different coding schemes, namely, spatial differential-coding scheme, motionvector-based differential-coding scheme and inter prediction scheme, thus utilizing both spatial and temporal redundancy to losslessly compress skeleton data. More importantly, these schemes are switched properly for different types of skeletons in video frames, hence achieving further improvement of compression rate. Experimental results show that our approach leads to 74.4% and 54.7% size reduction on our surveillance sequences and overall test sequences respectively, which demonstrates the…

Tables1

Table 1. Table 1 : Experimental results of different coding schemes. Sequences 0, 1, 2 come from to PoseTrack dataset [ 11 ] .

						Size(KB)
Seq.	Frames	Resolution	#/Frame	Frame Skip	Skeletons Source	Direct-coding	CM1	CM2	CM3	CM4
					GT	3.61	3.36(-6.7%)	0.86(-76.2%)	0.80(-77.8%)	0.78(-78.4%)
				0	ES	3.65	3.26(-10.6%)	1.42(-61.2%)	1.71(-53.1%)	1.42(-60.9%)
					GT	1.86	1.74(-6.7%)	0.58(-68.8%)	0.55(-70.5%)	0.53(-71.5%)
0	31	1280x720	3	1	ES	1.90	1.69(-11.3%)	0.90(-52.8%)	0.99(-47.9%)	0.86(-54.6%)
					GT	2.42	2.26(-6.8%)	1.14(-52.9%)	1.23(-49.2%)	1.11(-53.9%)
				0	ES	2.42	2.22(-8.1%)	1.00(-58.6%)	1.25(-48.2%)	1.01(-58.4%)
					GT	1.25	1.16(-7.3%)	0.67(-46.5%)	0.82(-34.3%)	0.67(-46.3%)
1	31	1280x720	2	1	ES	1.25	1.15(-8.3%)	0.61(-50.9%)	0.86(-31.0%)	0.61(-50.8%)
2	31	1280x720	2		GT	2.42	2.87(18.6%)	1.44(-40.4%)	1.28(-47.2%)	1.24(-48.6%)
				0	ES	2.74	3.21(17.3%)	2.15(-21.2%)	2.35(-14.0%)	2.19(-20.1%)
					GT	1.25	1.48(18.6%)	0.89(-28.8%)	1.08(-13.9%)	0.88(-29.2%)
				1	ES	1.33	1.58(18.6%)	1.13(-15.1%)	1.36(-2.2%)	1.15(-13.4%)
3	50	1008x672	8-10		GT	18.15	14.30(-21.2%)	5.61(-69.1%)	9.51(-47.6%)	5.60(-69.1%)
				0	ES	20.93	16.56(-20.9%)	11.64(-44.4%)	14.01(-33.0%)	11.37(-45.7%)
					GT	9.83	7.16(-27.1%)	4.38(-55.5%)	5.63(-42.7%)	4.45(-54.8%)
				1	ES	10.43	8.24(-21.0%)	6.56(-37.1%)	7.77(-25.5%)	6.44(-38.2%)
4	86	800x608	18-22		GT	65.28	46.03(-29.5%)	15.09(-76.9%)	24.97(-61.8%)	14.16(-78.3%)
				0	ES	76.29	52.69(-30.9%)	36.90(-51.6%)	46.00(-39.7%)	33.48(-56.1%)
					GT	32.66	23.02(-29.5%)	10.68(-67.3%)	15.76(-51.7%)	10.17(-68.8%)
				1	ES	38.20	26.38(-30.9%)	20.42(-46.6%)	24.81(-35.1%)	18.81(-50.8%)
5	80	1280x720	23-33		GT	86.27	61.69(-28.5%)	14.36(-83.4%)	29.96(-65.3%)	21.27(-75.3%)
				0	ES	86.28	58.86(-31.8%)	61.65(-28.5%)	89.65(-3.9%)	56.46(-34.6%)
					GT	43.31	30.94(-28.6%)	10.90(-74.8%)	19.21(-55.6%)	13.98(-67.7%)
				1	ES	41.47	27.87(-32.8%)	29.56(-28.7%)	43.89(-5.8%)	27.76(-33.1%)
6	100	1920x1080	34-35		GT	149.02	118.11(-20.7%)	7.43(-95.0%)	16.95(-88.6%)	11.66(-92.2%)
				0	ES	146.37	116.71(-20.3%)	85.42(-41.6%)	112.72(-23.0%)	77.44(-47.1%)
					GT	86.24	52.08(-39.6%)	6.09(-92.9%)	14.06(-83.7%)	9.67(-88.8%)
				1	ES	71.25	56.52(-20.7%)	43.80(-38.5%)	58.05(-18.5%)	39.75(-44.2%)
Average on our surveillance seq.					GT	-	-28.1%	-76.9%	-62.1%	-74.4%
Average on our surveillance seq.					ES	-	-26.2%	-39.6%	-20.6%	-43.7%
Average						-	-15.3%	-53.8%	-41.0%	-54.7%

Equations10

S K_{i} = {l_{i}, (x_{i, 1}, y_{i, 1}), (x_{i, 2}, y_{i, 2}), \dots, (x_{i, 14}, y_{i, 14})}

S K_{i} = {l_{i}, (x_{i, 1}, y_{i, 1}), (x_{i, 2}, y_{i, 2}), \dots, (x_{i, 14}, y_{i, 14})}

S K_{i} - S K_{k} = {(x_{i, j} - x_{k, j}, y_{i, j} - y_{k, j}) ∣ j = 1, 2, \dots, 14}

S K_{i} - S K_{k} = {(x_{i, j} - x_{k, j}, y_{i, j} - y_{k, j}) ∣ j = 1, 2, \dots, 14}

M V (S K_{i}, S K_{k}) = (M V_{x}, M V_{y}) = (x_{i, 2} - x_{k, 2}, y_{i, 2} - y_{k, 2})

M V (S K_{i}, S K_{k}) = (M V_{x}, M V_{y}) = (x_{i, 2} - x_{k, 2}, y_{i, 2} - y_{k, 2})

M C (S K_{i}) = {(x_{i, j} + M V_{x}, y_{i, j} + M V_{y}) ∣ j = 1, 3, 4, \dots, 14}

M C (S K_{i}) = {(x_{i, j} + M V_{x}, y_{i, j} + M V_{y}) ∣ j = 1, 3, 4, \dots, 14}

E P (S K_{i}) = S K_{i} - ME (S K_{i})

E P (S K_{i}) = S K_{i} - ME (S K_{i})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Video Analysis and Summarization

Full text

A Multimodal Lossless Coding Method for Skeletons in Videos

Abstract

Nowadays, skeleton information in videos plays an important role in human-centric video analysis but effective coding such massive skeleton information has never been addressed in previous work. In this paper, we make the first attempt to solve this problem by proposing a multimodal skeleton coding tool containing three different coding schemes, namely, spatial differential-coding scheme, motion-vector-based differential-coding scheme and inter prediction scheme, thus utilizing both spatial and temporal redundancy to losslessly compress skeleton data. More importantly, these schemes are switched properly for different types of skeletons in video frames, hence achieving further improvement of compression rate. Experimental results show that our approach leads to 74.4% and 54.7% size reduction on our surveillance sequences and overall test sequences respectively, which demonstrates the effectiveness of our skeleton coding tool.

**Index Terms— ** feature coding, skeleton coding

1 Introduction and Related work

Skeleton information in videos is of increasing important recently in many applications such as event detection, video recognition, etc. For example, previous works have shown how action recognition can benefit from skeleton-based video modeling [1, 2, 3, 4]. A person’s pose is described by multiple skeleton key joints and the skeleton information in videos represents the dynamic characteristics of body postures, which makes skeleton information widely used in human action recognition and other video analysis tasks.

Since video analysis is directly performed based on extracted features, shifting the feature extraction into the camera-integrated module can reduce the analysis server load and is highly desirable. Therefore, some feature coding methods that aim to compress and transmit different kinds of extracted features of videos are proposed recently. Duan et al. [5] describe the compact descriptors for video analysis, where handcrafted and deep features are compressed and transmitted in a standardized bitstream. Chen et al. [6] introduce their proposed Region-of-Interest (ROI) location coding tool where the ROI location information itself is coded in the video bitstream.

Recently, reliable human skeletons can be obtained from the depth sensor using real-time skeleton estimation algorithms. However, transmitting these skeletons directly back to the analysis server is too expensive. In this paper, we argue that skeleton information in videos plays an important role in video analysis. However, existing approaches have been overlooked coding this massive skeleton information. Therefore, it is necessary to develop new algorithms to encode this skeleton data efficiently. To the best of our knowledge, this paper is the first to study coding skeleton information into bitstream.

In our case, skeletons in many video frames need to be compressed and transmitted. We present human skeleton by fourteen key joints as shown in Fig. 1a. For example, the $1^{st}$ is located at the nose and that labeled as $11^{th}$ presents the right ankle. Our task is to encode and transmit the size and location of each key point of these skeletons to the decoder. One straightforward way to do this is to directly transmit the $(x,y)$ coordinates of every key joint. This simple method can work well when there are only few people in the video. However, when the number of skeletons becomes large (for example, the video-frame shown in Fig. 1b), these skeleton location data will become huge and non-negligible. According to our experiments, the skeleton data will take about 42% of the total bits for a video like Fig. 1b with about 35 skeletons in each frame. Therefore, new algorithms are required to efficiently compress these massive skeleton data.

To this end, we propose a novel approach to compress the skeleton information by combining skeletons encoderm lossless along with video codec, whose framework is shown in Fig. 1c. In the encoder, the input video frame will be encoded by video encoder such as H.265. Meanwhile, the skeletons of this video frame are encoded by our skeletons encoding module that also takes the skeletons of previous frames from the local skeletons decoder as input. These previous skeletons will be used as the reference to reduce the redundancy of skeletons in the current frame. Then the resulting skeletons bitstream will be added together with the bitstream of the frame as the final output bitstream. Since the decoding process can be easily derived from the encoding process, we will only focus on discussing skeleton encoding in this paper.

The proposed multimodal skeleton coding tool contains three coding schemes: (1) Spatial differential-coding scheme, (2) Motion-vector-based (MV-based) differential-coding scheme, and (3) Inter prediction scheme, which are switched dynamically to encode different types skeletons. In summary, our contributions are two folds:

This is the first work to study coding skeleton information itself into bitstream. A skeleton coding tool is developed in this paper, which achieves skeletons compression in videos with up to 54.7% compression rate on average. 2. 2.

We introduce three different schemes for skeleton coding. Furthermore, a multimodal scheme that integrates these schemes is proposed and achieves more robust skeletons encoding results.

The rest of paper is organized as follows: Section 2 describes the framework of our skeleton information coding tool. Section 3 describes the detail of our coding tool and its three sub-schemes. Section 4.2 shows the experimental settings and results. Section 5 concludes this paper.

2 Overview of our method

Fig 2 shows the framework of our multimodal skeleton coding algorithm. Skeletons are relayed to three coding schemes properly to achieve higher compression rate losslessly. The spatial differential-coding scheme utilizes the spatial redundancy to compress skeleton data while MV-based differential-coding scheme and inter prediction scheme are mainly based on the temporal redundancy. Thus, our multimodal skeleton coding tool can compress complex skeleton trajectories within crowed scene efficiently.

3 The skeleton information coding tool

In this section, we will first detail the definition of skeletons in video and then describe the three proposed skeleton coding schemes. Finally, a multimodal skeleton coding method is introduced.

3.1 Definitions

As we mentioned, the skeleton of a human can be described and coded by fourteen key points. According to this, we define the skeleton information as:

[TABLE]

where $l_{i}$ is the ID of the $i^{th}$ human $\mathbf{SK_{i}}$ in one frame and $(x_{j},y_{j})$ are the horizontal and vertical coordinates of $j^{th}$ key point of $\mathbf{SK_{i}}$ ( $j\in\{0,1,\dots,14\}$ ). Note that each person has a unique ID over whole video and is decided according to its first appearing time in the video. The index of $i^{th}$ skeleton $i$ in one frame is decided according to its label. With these 29 elements, one human skeleton in one video frame can be determined uniquely.

The difference between two skeletons are defined as the set of difference between the same key joint:

[TABLE]

3.2 Skeleton Coding Schemes

Three coding schemes are introduced in our skeleton information coding tool:

Spatial differential-coding scheme. Considering the spatial correlation of joints within a skeleton, we developed a spatial differential-coding scheme that utilizes the spatial redundancy to compress the skeleton data. As shown in Fig. 3, only the absolute coordinates of $1^{th}$ joint with the difference vectors (see the red joint and vectors between joints) of a skeleton are encoded.

The procedure is as follows: for each skeleton in a frame, the coordinates of $1^{th}$ joint are first encoded and a set $\mathbf{E}=\{1\}$ that represents the $1^{th}$ joint has been encoded is initialized. Then for each encoded joint in set $\mathbf{E}$ , the difference between it and each of its neighbors are encoded. This process is repeated until all joints of a skeleton are encoded.

MV-based differential-coding scheme. When a lot of skeletons exist and need to be encoded in a dense crowd scene, we need a new compression algorithm for skeletons to deal with such huge amount of skeleton data efficiently. Therefore, we developed a MV-based difference-coding scheme that mainly utilizes the temporal redundancy of skeletons (the same persons’ skeletons in different frames are highly correlated). As shown in Fig. 4, the skeleton with lighter yellow joints and dash lines in $t^{th}$ frame is co-located with the one in $(t-1)^{th}$ frame. Then a predicted skeleton is obtained using the motion vector calculated with the $2^{nd}$ joint (The $2^{nd}$ joint corresponds to the center of a human) of co-located and original skeletons. Finally, the differences between the predicted skeleton using MV and the original one are encoded.

Formally, for a frame at $T=t$ , the $(t-1)^{th}$ frame is chosen as the reference frame. Then for each skeleton $\mathbf{SK_{i}^{t}}$ , difference between it and its corresponding skeleton in selected reference frame is encoded. More specifically, the motion vector (MV) of $2^{nd}$ joint is first calculated:

[TABLE]

Then the motion compensation (MC) of other joints of $\mathbf{SK_{i}}$ is achieved using the MV of $2^{nd}$ joint:

[TABLE]

Finally, the encoded parameters is defined as:

[TABLE]

Inter prediction scheme. In the MV-based differential-coding scheme, the motion vector of $2^{nd}$ joint is utilized to predict all joints. It is the optimal solution when the skeleton is nearly translated from the previous to the current frame (i.e. every joint of the body moves in the same direction and over the same distance, without any rotation, reflection). However, human bodies are non-rigid objects and therefore the real situation is different obviously. Therefore, we argue that more accurate predictions of joints will lead to less residual, thus achieving a higher compression rate.

For inter prediction scheme, the corresponding skeletons in $(t-1)^{th},(t-2)^{th}$ frames are used to predict the skeleton in $t^{th}$ frame (light yellow joints and dash lines) as shown in Fig. 5. Then the differences between the original skeleton and the predicted skeleton are encoded.

Trajectories prediction. There are a lot of researches working on trajectories prediction [7, 8, 9, 10]. In our method, the trajectories prediction method proposed in [10] is used. More specifically, every key joint of a skeleton in $t^{th}$ frame is predicted individually with the corresponding joint in $(t-1)^{th}$ and $(t-2)^{th}$ frame (i.e. the $(t-1)^{th}$ and $(t-2)^{th}$ frames are chosen as the reference frames).

3.3 Multimodal skeleton coding

Considering labeling the skeleton data is expensive, the skeletons in videos may be the data estimated by the existing skeleton estimation methods. However, these methods may introduce some unexpected skeleton trajectories (for example, lack of key joints, inaccurate matching, and tracking), which leads to the correlations between skeletons become more complex and a more robust and efficient algorithm is needed. To this end, we propose a multimodal skeleton coding method where three schemes are switched for encoding skeletons.

The framework of our multimodal skeleton coding scheme has been shown in Fig. 2. Moreover, the switching rules are defined as follow:

For a skeleton that newly appears in the current frame, the spatial difference-coding scheme is used. Besides, the spatial differential-coding scheme is also used for the first frame. 2. 2.

When both MV-based differential-coding scheme and inter prediction can be used simultaneously for a skeleton, the one with less encoded bit length is chosen. A flag indicating the chosen scheme is allocated and transmitted. 3. 3.

For other skeletons that exist in $(t-1)^{th}$ and $t^{th}$ but can not be found in $(t-2)^{th}$ frame, MV-based differential-coding scheme is used.

Furthermore, several details should be noted: (1) For a skeleton that exists in the previous frame but disappears in the current frame, a disappear flag is allocated in bitstream. (2) For a skeleton that is exactly the same as its corresponding skeleton in the reference frame, a skip flag is allocated to indicate such condition instead of encoding fourteen zeros.

Fig 6 shows an example of coding skeletons in a frame using our proposed multimodal coding method. $S_{1}$ exists in all three frames and therefore both MV-based scheme and inter prediction scheme can be used. Finally, the MV-based scheme that leads to less bit length for encoding this skeleton is chosen and a flag is transmitted. Because $S_{2}$ only exists in the last and current frame, MV-based scheme is chosen. $S_{1}$ disappears in the current frame so that a skip flag is allocated. As for $S_{4}$ that newly appears in the current frame, the spatial differential-coding scheme is applied. The resulting bitstream of $t^{th}$ frame is also shown in Fig 6.

4 Experimental results

4.1 Settings

In our experiments, the aforementioned four schemes (three single-modal schemes and one multimodal scheme) are evaluated and compared.

During the test, 7 videos with different resolutions and scenes are included. Three of them come from PoseTrack dataset [11] and others are collected and labeled by ourselves. Some examples of them are shown in Fig. 7. To evaluate the performance of our methods under different motion degrees, test sequences are re-sampled with different sample rates before being encoded. Apart from encoding the ground truth of skeletons (GT), we also evaluate our methods with skeletons estimated by [12] (ES). Note that only compression rate is used to evaluate our proposed lossless compression method.

4.2 Results of different coding schemes.

Table 1 compares the performance of different coding methods. In Table 1, CM1 represents using the spatial differential-coding scheme; CM2 represents using the MV-based differential-coding; CM3 represents using the inter prediction scheme; CM4 represents our full version, multimodal coding method. Note that for a skeleton that MV-based scheme (inter prediction scheme) can not be used, spatial differential-coding scheme is used in CM2 (CM3). From Table 1, we can have the following observations:

The full version of our approach, the multimodal coding method (CM4), achieves the best performance on average. Specifically, it can reduce 54.7% size of encoded skeleton data on average. 2. 2.

More importantly, our multimodal scheme shows superior performance (extra 4.1% compression) to MV-based scheme when compressing estimated skeletons of surveillance sequences (i.e. the most practical situation). This demonstrates that our multimodal coding method is more robust than other compared methods when the skeletons trajectories in videos are complex and noisy and therefore is especially useful in the real applications. 3. 3.

When looking at encoding annotated skeletons of our collected surveillance sequences, 76.9% and 74.4% reduction of encoded size are obtained by our MV-based differential-coding scheme and multimodal coding method, respectively. This clearly indicates the effectiveness of our designed skeleton coding schemes. 4. 4.

Our MV-based scheme achieves 53.7% compression rate across all test sequences, which is slightly worse than our multimodal scheme. This indicates that MV-based scheme can also provide satisfactory results at different kinds of applications.

5 Conclusion

This paper presents a new skeleton coding tool for encoding skeletons in videos. We introduce a multimodal scheme where three encoding sub-schemes that utilize both spatial and temporal redundancy to compress skeleton data are switched properly, hence achieving higher coding efficiency. Experimental results show that skeleton data can be reduced efficiently using our multimodal coding tool.

Acknowledgement

This paper is supported in part by: Shanghai “The Belt and Road” Young Scholar Exchange Grant (17510740100), the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation.

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Hongsong Wang and Liang Wang, “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks,” in Conference on Computer Vision and Pattern Recognition (CVPR) , 2017.
2[2] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, and Farid Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on . IEEE, 2017, pp. 4570–4579.
3[3] Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou, “Deep progressive reinforcement learning for skeleton-based action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp. 5323–5332.
4[4] Girum G Demisse, Konstantinos Papadopoulos, Djamila Aouada, and Bjorn Ottersten, “Pose encoding for robust skeleton-based action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , 2018, pp. 188–194.
5[5] Ling-Yu Duan, Vijay Chandrasekhar, Shiqi Wang, Yihang Lou, Jie Lin, Yan Bai, Tiejun Huang, Alex Chichung Kot, and Wen Gao, “Compact descriptors for video analysis: The emerging mpeg standard,” IEEE Multi Media , 2018.
6[6] Mingliang Chen, Weiyao Lin, and Xiaozhen Zheng, “An efficient coding method for coding region-of-interest locations in avs 2,” in Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on . IEEE, 2014, pp. 1–5.
7[7] Gianluca Antonini, Michel Bierlaire, and Mats Weber, “Discrete choice models of pedestrian walking behavior,” Transportation Research Part B: Methodological , vol. 40, no. 8, pp. 667–687, 2006.
8[8] Kota Yamaguchi, Alexander C Berg, Luis E Ortiz, and Tamara L Berg, “Who are you with and where are you going?,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on . IEEE, 2011, pp. 1345–1352.