TL;DR
This paper introduces a novel fine-grained visual-textual representation learning method that uses GANs to automatically discover discriminative parts by jointly modeling visual and textual data, enhancing categorization accuracy.
Contribution
It proposes a new approach that leverages textual attention to guide visual part discovery and combines visual and textual features for improved fine-grained categorization.
Findings
Automatically discovers discriminative parts using GANs
Joint visual-textual modeling improves categorization accuracy
Enhances fine-grained recognition performance
Abstract
Fine-grained visual categorization is to recognize hundreds of subcategories belonging to the same basic-level category, which is a highly challenging task due to the quite subtle and local visual distinctions among similar subcategories. Most existing methods generally learn part detectors to discover discriminative regions for better categorization performance. However, not all parts are beneficial and indispensable for visual categorization, and the setting of part detector number heavily relies on prior knowledge as well as experimental validation. As is known to all, when we describe the object of an image via textual descriptions, we mainly focus on the pivotal characteristics, and rarely pay attention to common characteristics as well as the background areas. This is an involuntary transfer from human visual attention to textual attention, which leads to the fact that textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
