Learning Fixation Point Strategy for Object Detection and Classification
Jie Lyu, Zejian Yuan, Dapeng Chen

TL;DR
This paper introduces a recurrent attentional model that localizes and classifies objects by sequentially extracting local observations, improving detection accuracy and speed, especially for small objects, without relying on traditional sliding windows or convolutions.
Contribution
The paper presents a novel recurrent attentional framework with a hybrid loss and a stochastic object-aware strategy, enabling end-to-end training for joint detection and classification.
Findings
High detection precision achieved on a new real-world dataset.
Model predicts accurate bounding boxes without pooling operations.
Speed and accuracy can be adjusted by changing recurrent steps.
Abstract
We propose a novel recurrent attentional structure to localize and recognize objects jointly. The network can learn to extract a sequence of local observations with detailed appearance and rough context, instead of sliding windows or convolutions on the entire image. Meanwhile, those observations are fused to complete detection and classification tasks. On training, we present a hybrid loss function to learn the parameters of the multi-task network end-to-end. Particularly, the combination of stochastic and object-awareness strategy, named SA, can select more abundant context and ensure the last fixation close to the object. In addition, we build a real-world dataset to verify the capacity of our method in detecting the object of interest including those small ones. Our method can predict a precise bounding box on an image, and achieve high speed on large images without pooling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
