Boon: A Neural Search Engine for Cross-Modal Information Retrieval

Yan Gong; Georgina Cosma

arXiv:2307.14240·cs.MM·November 2, 2023

Boon: A Neural Search Engine for Cross-Modal Information Retrieval

Yan Gong, Georgina Cosma

PDF

Open Access

TL;DR

Boon is a cross-modal search engine that integrates advanced neural networks to enable accurate image-text retrieval and multilingual conversational interactions, enhancing accessibility and understanding of visual content.

Contribution

The paper introduces Boon, a novel search engine combining GPT-3.5-turbo and VITR networks for improved cross-modal retrieval and multilingual conversational AI.

Findings

01

Effective image-to-text and text-to-image retrieval demonstrated.

02

Multilingual conversational capabilities enabled.

03

Accessible features for visually impaired users implemented.

Abstract

Visual-Semantic Embedding (VSE) networks can help search engines better understand the meaning behind visual content and associate it with relevant textual information, leading to more accurate search results. VSE networks can be used in cross-modal search engines to embed image and textual descriptions in a shared space, enabling image-to-text and text-to-image retrieval tasks. However, the full potential of VSE networks for search engines has yet to be fully explored. This paper presents Boon, a novel cross-modal search engine that combines two state-of-the-art networks: the GPT-3.5-turbo large language model, and the VSE network VITR (VIsion Transformers with Relation-focused learning) to enhance the engine's capabilities in extracting and reasoning with regional relationships in images. VITR employs encoders from CLIP that were trained with 400 million image-description pairs and it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning