Unsupervised Action Localization Crop in Video Retargeting for 3D ConvNets
Prithwish Jana, Swarnabja Bhaumik, Partha Pratim Mohanta

TL;DR
This paper introduces an unsupervised method for cropping videos to focus on subjects for 3D CNNs, improving classification accuracy by maintaining subject visibility and reducing flickering through a novel shape fitting approach.
Contribution
It presents a new unsupervised video cropping technique using action localization and 3D shape fitting to enhance 3D CNN performance on untrimmed videos.
Findings
Outperforms random cropping in classification accuracy on benchmark datasets.
Maintains subject focus and reduces flickering in cropped videos.
Achieves higher accuracy with smaller or same-sized crops compared to larger random crops.
Abstract
Untrimmed videos on social media or those captured by robots and surveillance cameras are of varied aspect ratios. However, 3D CNNs usually require as input a square-shaped video, whose spatial dimension is smaller than the original. Random- or center-cropping may leave out the video's subject altogether. To address this, we propose an unsupervised video cropping approach by shaping this as a retargeting and video-to-video synthesis problem. The synthesized video maintains a 1:1 aspect ratio, is smaller in size and is targeted at video-subject(s) throughout the entire duration. First, action localization is performed on each frame by identifying patches with homogeneous motion patterns. Thus, a single salient patch is pinpointed per frame. But to avoid viewpoint jitters and flickering, any inter-frame scale or position changes among the patches should be performed gradually over time.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
Methods3 Dimensional Convolutional Neural Network
