Adaptive Intermediate Representations for Video Understanding
Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo,, Anelia Angelova

TL;DR
This paper introduces a novel framework that leverages semantic segmentation as an intermediate representation for video understanding, jointly learned with the task, improving performance without extra inference data.
Contribution
It proposes a joint learning framework for intermediate representations like optical flow and segmentation, optimized via evolutionary search, enhancing video understanding performance.
Findings
Achieves state-of-the-art performance on video understanding benchmarks.
No additional data needed during inference beyond RGB sequences.
Optimized loss weighting improves the effectiveness of intermediate representations.
Abstract
A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow. In this work, we introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding and use it in a way that requires no additional labeling. Second, we propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task and allows the adaptation of the representations to the end goal. Despite the use of intermediate representations within the network, during inference, no additional data beyond RGB sequences is needed, enabling efficient recognition with a single network. Finally, we present a way to find the optimal learning configuration by searching the best loss weighting via evolution. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
