A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, Bo Li

TL;DR
This paper introduces RCCF, a real-time cross-modality correlation filtering approach for referring expression comprehension, achieving high speed and improved accuracy by reformulating the task as a correlation filtering process.
Contribution
The paper proposes a novel one-stage correlation filtering method that enables real-time inference without accuracy loss, differing from traditional multi-stage approaches.
Findings
Runs at 40 FPS, outperforming existing methods in speed.
Almost doubles the state-of-the-art performance on RefClef dataset.
Achieves leading results on multiple benchmarks including RefCOCO and RefCOCO+.
Abstract
Referring expression comprehension aims to localize the object instance described by a natural language expression. Current referring expression methods have achieved good performance. However, none of them is able to achieve real-time inference without accuracy drop. The reason for the relatively slow inference speed is that these methods artificially split the referring expression comprehension into two sequential stages including proposal generation and proposal ranking. It does not exactly conform to the habit of human cognition. To this end, we propose a novel Realtime Cross-modality Correlation Filtering method (RCCF). RCCF reformulates the referring expression comprehension as a correlation filtering process. The expression is first mapped from the language domain to the visual domain and then treated as a template (kernel) to perform correlation filtering on the image feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Heatmap
