TL;DR
This paper introduces Class Feature Pyramids, a versatile method for generating human-understandable explanations of 3D convolutional networks in video action recognition, applicable across various architectures and datasets.
Contribution
We propose Class Feature Pyramids, a novel approach that traverses network structures to identify class-informative kernels at multiple depths, enhancing interpretability of 3D CNNs.
Findings
Effective on six state-of-the-art 3D CNNs
Applicable across diverse architectures and convolution types
Provides insights into class-specific network features
Abstract
Deep convolutional networks are widely used in video action recognition. 3D convolutions are one prominent approach to deal with the additional time dimension. While 3D convolutions typically lead to higher accuracies, the inner workings of the trained models are more difficult to interpret. We focus on creating human-understandable visual explanations that represent the hierarchical parts of spatio-temporal networks. We introduce Class Feature Pyramids, a method that traverses the entire network structure and incrementally discovers kernels at different network depths that are informative for a specific class. Our method does not depend on the network's architecture or the type of 3D convolutions, supporting grouped and depth-wise convolutions, convolutions in fibers, and convolutions in branches. We demonstrate the method on six state-of-the-art 3D convolution neural networks (CNNs)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods3D Convolution · Convolution
