ICAR: Image-based Complementary Auto Reasoning
Xijun Wang, Anqi Liang, Junbang Liang, Ming Lin, Yu Lou, Shan Yang

TL;DR
This paper introduces ICAR, a novel framework for scene-aware complementary item retrieval that leverages a flexible bidirectional transformer to understand visual compatibility and generate compatible items across domains.
Contribution
It proposes a category-aware transformer model that learns inter-object compatibility from large scene datasets in a self-supervised manner, improving retrieval performance.
Findings
Achieves up to 5.3% and 9.6% improvements in FITB scores on fashion and furniture.
Realizes 22.3% and 31.8% SFID improvements over state-of-the-art methods.
Introduces a generalizable cross-domain visual similarity embedding approach.
Abstract
Scene-aware Complementary Item Retrieval (CIR) is a challenging task which requires to generate a set of compatible items across domains. Due to the subjectivity, it is difficult to set up a rigorous standard for both data collection and learning objectives. To address this challenging task, we propose a visual compatibility concept, composed of similarity (resembling in color, geometry, texture, and etc.) and complementarity (different items like table vs chair completing a group). Based on this notion, we propose a compatibility learning framework, a category-aware Flexible Bidirectional Transformer (FBT), for visual "scene-based set compatibility reasoning" with the cross-domain visual similarity input and auto-regressive complementary item generation. We introduce a "Flexible Bidirectional Transformer (FBT)" consisting of an encoder with flexible masking, a category prediction arm,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Absolute Position Encodings · Residual Connection
