Boon: A Neural Search Engine for Cross-Modal Information Retrieval
Yan Gong, Georgina Cosma

TL;DR
Boon is a cross-modal search engine that integrates advanced neural networks to enable accurate image-text retrieval and multilingual conversational interactions, enhancing accessibility and understanding of visual content.
Contribution
The paper introduces Boon, a novel search engine combining GPT-3.5-turbo and VITR networks for improved cross-modal retrieval and multilingual conversational AI.
Findings
Effective image-to-text and text-to-image retrieval demonstrated.
Multilingual conversational capabilities enabled.
Accessible features for visually impaired users implemented.
Abstract
Visual-Semantic Embedding (VSE) networks can help search engines better understand the meaning behind visual content and associate it with relevant textual information, leading to more accurate search results. VSE networks can be used in cross-modal search engines to embed image and textual descriptions in a shared space, enabling image-to-text and text-to-image retrieval tasks. However, the full potential of VSE networks for search engines has yet to be fully explored. This paper presents Boon, a novel cross-modal search engine that combines two state-of-the-art networks: the GPT-3.5-turbo large language model, and the VSE network VITR (VIsion Transformers with Relation-focused learning) to enhance the engine's capabilities in extracting and reasoning with regional relationships in images. VITR employs encoders from CLIP that were trained with 400 million image-description pairs and it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
