Long-Term Feature Banks for Detailed Video Understanding
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp, Kr\"ahenb\"uhl, Ross Girshick

TL;DR
This paper introduces a long-term feature bank that enhances video models by providing context over entire videos, significantly improving performance on multiple challenging datasets.
Contribution
It proposes a novel long-term feature bank to augment existing video models, enabling better understanding through extended temporal context.
Findings
Achieved state-of-the-art results on AVA, EPIC-Kitchens, and Charades datasets.
Augmentation with the feature bank improves model performance.
Demonstrated effectiveness of long-term context in video understanding.
Abstract
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank---supportive information extracted over the entire span of a video---to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Advanced Vision and Imaging
