What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Zhi Chen; Qiguang Chen; Libo Qin; Qipeng Guo; Haijun Lv; Yicheng Zou; Wanxiang Che; Hang Yan; Kai Chen; Dahua Lin

arXiv:2409.01893·cs.CL·May 20, 2025

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Haijun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai Chen, Dahua Lin

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

This paper introduces the MIMG framework to generate high-quality, multi-hop instruction data for long context tasks, significantly improving model performance over existing synthetic data methods.

Contribution

The paper presents the Multi-agent Interactive Multi-hop Generation (MIMG) framework, enhancing synthetic data quality for long context multi-hop tasks and systematically analyzing data generation strategies.

Findings

01

High-quality, multi-hop data exceeds 85% in the proposed framework.

02

Synthetic data can outperform models trained on larger human-annotated datasets.

03

The MIMG framework improves long context understanding in language models.

Abstract

Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The motivation of this paper is clear. 2. The exploration of methods within each agent module of the framework is thorough.

Weaknesses

1. The paper contains some errors; for example, Figure 10 shows only one image but is labeled (a). 2. While the authors have explored methods within each agent module of the proposed framework to enhance data generation quality, there is a lack of ablation studies between the agents, making it unclear which agent contributes the most. 3. The experiments are not sufficiently generalized, as they were only evaluated on InternLM. I believe validation on widely used models like the LLaMA series is n

Reviewer 02Rating 6Confidence 4

Strengths

1. Compared to previous multi-hop data generation methods like Self-Instruct, the MIMG framework significantly enhances the proportion of multi-hop data, as well as the diversity and quality of the data. 2. The authors conduct a thorough analysis of various potentially impactful strategies, such as document selection strategies and the impact of question merging methods. This provides practical references for future research endeavors. 3. The synthesized long context dataset (LongMIT) effectivel

Weaknesses

1. Although the author provides a detailed analysis of the impact of different strategies on the multi-hop data ratio, quality, or diversity in various components, they do not analyze **the impact of these components on the final performance**. Specifically, the roles of the Quality Verification Agent, Single-hop Question Generation Agent, Multiple Question Sampling, and Multi-hop Question Merger Agent in the final framework are not discussed. Analyzing these would help demonstrate the independe

Reviewer 03Rating 6Confidence 3

Strengths

The main strengths of this paper include: (1). Innovative Multi-agent Generation Framework: The proposed Multi-agent Interactive Multi-hop Generation (MIMG) framework incorporates multiple agents (Quality Verification Agent, Single-hop Question Generation Agent, Multiple Question Sampling Strategy, and Multi-hop Question Merging Agent), significantly improving the quality and diversity of generated data. (2). Extensive Experimental Validation: The paper systematically investigates various d

Weaknesses

The main limitations of this paper are: 1). The primary weakness of this paper lies in its limited novelty. The contributions primarily emphasize engineering implementations and optimizations rather than presenting groundbreaking theoretical or methodological advancements. While the proposed framework demonstrates effective improvements in long-context, multi-hop instruction datasets, it largely builds upon existing concepts and technologies in a structured engineering fashion. 2). Limited

Code & Models

Repositories

wowcz/longmit
noneOfficial

Datasets

donmaclean/LongMIT-128K
dataset· 265 dl
265 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Teaching Methods