ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox

TL;DR
This paper introduces ECO, a convolutional network architecture that efficiently captures long-term video content for online understanding, enabling fast processing and high-quality classification and captioning.
Contribution
ECO integrates long-term content modeling directly into the network, significantly improving speed and accuracy for online video understanding tasks.
Findings
Achieves up to 230 videos per second processing speed.
Outperforms state-of-the-art methods by 10x to 80x in speed.
Maintains competitive accuracy across multiple datasets.
Abstract
The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
