Multi-Field Adaptive Retrieval

Millicent Li; Tongfei Chen; Benjamin Van Durme; Patrick Xia

arXiv:2410.20056·cs.IR·April 18, 2025

Multi-Field Adaptive Retrieval

Millicent Li, Tongfei Chen, Benjamin Van Durme, Patrick Xia

PDF

Open Access 1 Video 3 Reviews

TL;DR

Multi-Field Adaptive Retrieval (MFAR) is a flexible framework that improves document retrieval by decomposing documents into fields, independently indexing them, and adaptively weighting fields based on queries, leading to state-of-the-art results.

Contribution

The paper introduces MFAR, a novel framework that handles multi-field structured data by decomposing documents, independently indexing fields, and adaptively predicting field importance for improved retrieval.

Findings

01

Significantly improves document ranking over existing retrievers.

02

Achieves state-of-the-art performance on multi-field structured data.

03

Effectively combines dense and lexical representations across fields.

Abstract

Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 5Confidence 3

Strengths

The paper tackles a relevant problem, although the proposal has some limitations discussed below. The paper is well written and easy to understand. Originality is rather limited, since the semi-structured retrieval has been researched for a long time. The significance of the results is also limited because the approach is rather simple, and the experimental evaluation focuses on a recently published benchmark. The paper quality is good to its purposes, despite some adhoc design decisions.

Weaknesses

It is not clear to me whether the fields may be considered indepent, so that the summation of the field-related scores suffices for determining the overall score. That is, it seems intuitive that there are correlations among the field instances and they may bias the result, as was extensively researched in the information retrieval area. Another field-related issue is regarding the process of selecting the fields that will be considered in the whole process. Overall, the proposal is simple and

Reviewer 02Rating 8Confidence 4

Strengths

Rereach on structured document retrieval is highly relevant, especially for RAG approaches. The retrieval is well designed, using a hybrid and adpative query scoring mechanism, using both dense and lexical methods as well as a ranking strategy. The evaluation is thorough, and the paper is well-structured and generally well-written.

Weaknesses

The fine-tuning approach makes the approach specific to a set of fields from a dataset. Information overlap in fields (see lines 416-424) might intrudice some redundancy to the retrieval process.

Reviewer 03Rating 8Confidence 4

Strengths

The paper is well motivated and well written. Many documents naturally have structure. Exploiting it should lead to better retrieval quality. However, doing so depends on the query, corpus and kind of retrieval. This paper comes up with an elegant and intuitive formulation to learn all of these weights during training. The baselines are well chosen, ablation experiments are extensive and the references are comprehensive.

Weaknesses

The main weakness, and kudos to the authors for discussing this, is what they mention in Section 4 "Multi-field vs Single-field": "A side-by-side comparison of the single field models against their multi-field counterparts shows mixed results". The difference in averages between mFAR_2 and mFAR_all doesn't appear statistically significant. The primary gains (looking at mFAR_2 vs mFAR_all) seem to be in the prime data set which has a lot of fields. Should the paper instead focus on adaptive hybr

Videos

Multi-Field Adaptive Retrieval· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Machine Learning and Algorithms · Information Retrieval and Search Behavior