Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Wenyu Huang; Pavlos Vougiouklis; Mirella Lapata; Jeff Z. Pan

arXiv:2505.11754·cs.CL·May 20, 2025

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation

Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper analyzes how different language models perform on multi-hop question answering when the order of context documents is permuted, revealing model differences, the impact of document order, and improvements via attention modifications.

Contribution

It provides a comparative analysis of encoder-decoder and decoder-only models in multi-hop QA, explores the effects of document permutation, and proposes attention-based enhancements.

Findings

01

Encoder-decoder models outperform decoder-only models in MHQA.

02

Document order aligned with reasoning improves performance.

03

Bi-directional attention boosts causal decoder-only models.

Abstract

Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hwy9855/multihopqa-reasoning
pytorchOfficial

Videos

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Softmax · Attention Dropout · SentencePiece · Residual Connection · Linear Layer · Dropout · Inverse Square Root Schedule