TL;DR
This paper introduces a novel sketch-based image retrieval method that encodes entire scene compositions with multiple objects, leveraging CNNs trained with triplet loss for efficient and accurate search in image collections.
Contribution
The work presents a new approach for compositional scene search using sketches, combining CNN encoding of object appearances and spatial relationships with efficient metric search techniques.
Findings
Effective encoding of multi-object scenes from sketches
Improved retrieval accuracy for complex compositions
Efficient search via product quantization
Abstract
We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects. Sketch based image retrieval (SBIR) methods predominantly match queries containing a single, dominant object invariant to its position within an image. Our work exploits drawings as a concise and intuitive representation for specifying entire scene compositions. We train a convolutional neural network (CNN) to encode masked visual features from sketched objects, pooling these into a spatial descriptor encoding the spatial relationships and appearances of objects in the composition. Training the CNN backbone as a Siamese network under triplet loss yields a metric search embedding for measuring compositional similarity which may be efficiently leveraged for visual search by applying product quantization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSiamese Network · Triplet Loss
