TL;DR
This paper introduces PatchNet, a weakly supervised patch-level model, and VSAD, a hybrid global image representation, achieving state-of-the-art scene recognition results by combining CNN features with descriptor encoding.
Contribution
It presents a novel weakly supervised patch-level network and a hybrid representation for scene recognition, improving accuracy over existing methods.
Findings
Achieved 86.2% accuracy on MIT Indoor67
Achieved 73.0% accuracy on SUN397
Outperformed previous state-of-the-art methods
Abstract
Traditional feature encoding scheme (e.g., Fisher vector) with local descriptors (e.g., SIFT) and recent convolutional neural networks (CNNs) are two classes of successful methods for image recognition. In this paper, we propose a hybrid representation, which leverages the discriminative capacity of CNNs and the simplicity of descriptor encoding schema for image recognition, with a focus on scene recognition. To this end, we make three main contributions from the following aspects. First, we propose a patch-level and end-to-end architecture to model the appearance of local patches, called {\em PatchNet}. PatchNet is essentially a customized network trained in a weakly supervised manner, which uses the image-level supervision to guide the patch-level feature extraction. Second, we present a hybrid visual representation, called {\em VSAD}, by utilizing the robust feature representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
