Needle: A Generative AI-Powered Multi-modal Database for Answering Complex Natural Language Queries
Mahdi Erfanian, Mohsen Dehghankar, Abolfazl Asudeh

TL;DR
Needle introduces a generative AI-powered multi-modal database that enhances complex natural language query answering in image retrieval by generating synthetic samples, outperforming existing methods and supporting easy integration of foundation models.
Contribution
The paper presents Needle, a novel generative-based approach for multi-modal data retrieval that leverages foundation models to generate synthetic samples, improving complex query answering.
Findings
Outperforms state-of-the-art text-to-image retrieval methods
Utilizes generative models for capturing query complexity
Supports easy integration of various foundation models
Abstract
Multi-modal datasets, like those involving images, often miss the detailed descriptions that properly capture the rich information encoded in each item. This makes answering complex natural language queries a major challenge in this domain. In particular, unlike the traditional nearest neighbor search, where the tuples and the query are represented as points in a single metric space, these settings involve queries and tuples embedded in fundamentally different spaces, making the traditional query answering methods inapplicable. Existing literature addresses this challenge for image datasets through vector representations jointly trained on natural language and images. This technique, however, underperforms for complex queries due to various reasons. This paper takes a step towards addressing this challenge by introducing a Generative-based Monte Carlo method that utilizes foundation…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is logically well-structured. It is progressing smoothly from problem analysis, motivation, to the proposed solution, and then to optimization strategies. The approach is coherent and reasonable. From a system perspective, this paper addresses and optimizes practical issues faced by the generate-and-retrieve paradigm, tackling problems of notable significance. The experimental analysis is sufficiently thorough, making the results convincing. The code framework is well-opened and compl
Using image generation models for cross-modal retrieval is not novel, as many related works [1,2,3] have conducted similar studies. Although the paper emphasizes the use of foundation models, it only experiments with image generation models and does not include VLM, which intuitively might be more suitable for text-image retrieval tasks. As Figure 5(d), using image generation model is significantly less efficient compared to general retrieval methods. Although the paper proposes several optimiz
- The idea of translating text queries into synthetic multimodal representations before retrieval is conceptually interesting. It reframes the retrieval task as a generative sampling problem, bridging generation and search. - The paper goes beyond an algorithmic prototype and implements a deployable system with modular embedders, anomaly filtering, and caching. The inclusion of practical optimizations shows strong engineering effort. - Comprehensive experiments are provided across multiple data
- The paper only compares against 2023-era contrastive models (CLIP, ALIGN, FLAVA, CoCa, BLIP). Recent multimodal embeddings such as UniIR[1] and E5-V[2] and VLM2Vec[3] using MLLM are now standard baselines for multimodal retrieval. Without evaluating against these stronger MLLM representations, it is unclear whether NEEDLE remains competitive in the modern multimodal landscape. - The paper reports a total inference time of 0.203 s comparable to CLIP’s 0.184 s despite including computationally
Very solid baselines to compare against! Baselines include (in L239): CLIP [58], ALIGN [28]; FLAVA [65] and CoCa [85]; BLIP and MiniLM [39, 77]; and PlugIR [34]. A wide range of datasets for evaluating retrieval performance (L249)! Very solid ablations that include the choice of hyperparameters (e.g. number of guide images and embedders in FIgure 4), choice of image generation models for creating guide images (Figure 5), and the separate effect of using generated guide images (Table 5). The
Using Local Outlier Factor (LOF) to detect poor-quality generated guide images sounds like it would also eliminate any image that's "novel" or "unique" or "creative", because these would also cause a guide image to have a high LOF. If the user is an artist who draws distinct images (compared to the average internet images), and wants to search for one of their own creations among other more "nomral-looking" images, I believe the LOF rule would accidentally discard what may be valid guide images.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsContrastive Learning
