TL;DR
This paper presents StructTuning, a novel approach that efficiently transforms large language models into domain experts by leveraging structured knowledge, reducing training data needs while maintaining high performance.
Contribution
The paper introduces StructTuning, a two-stage, structure-aware method for domain knowledge injection into LLMs, significantly improving data efficiency and model performance.
Findings
Achieves full knowledge injection performance with only 5% of training data.
Outperforms existing knowledge injection methods on LongBench and MMedBench datasets.
Demonstrates scalability across different model sizes and training corpus scales.
Abstract
This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly reduces the training corpus needs to a mere 5% while achieving an impressive 100% of traditional knowledge injection performance. Motivated by structured human education, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- This paper proposes a novel method that substantially improve the data efficiency and effectiveness of knowledge injection from domain-specific corpora into pretrained LLMs. The method is intuitive and has the nice feature of exploiting the structure in domain knowledge, organizing text chunks as a mindmap before letting the model learn the contents based on the mindmap. - The paper is well-written, easy-to-follow, and detailed. - The main experimental results are strong and the ablation demon
The Questions below need to be addressed. Q4 about corpus size is important and needs clarification. Otherwise the paper is in a good shape overall. After Author Rebuttal: While I appreciate the author responses and my clarification questions have been mostly addressed, the Q4 remains concerning. A main result in the paper is that the proposed LLM+Ours that uses 76M tokens can achieve 50% of the improvement over the baseline LLM compared to LLM+MMed that uses 25.5B tokens. However, "Ours" uses
- Efficiency: The paper demonstrates impressive efficiency gains, achieving comparable performance with significantly less training data (0.3% compared to existing methods). The proposed scaling law hints at even greater potential for efficiency with larger datasets. - Comprehensive Evaluation: The experiments cover different model architectures and scales, and utilize diverse datasets and tasks, including both recall-based and reasoning-based evaluations. The ablation study provides valuable i
Commonality in Code Training: While the paper claims originality in its structure-aware approach, this technique is already prevalent in contexts where training data consists of code or hierarchical data structures. In code training, for instance, it’s common to replace long sequences with structured, layered representations that summarize relationships across hierarchies. This may limit the novelty of StructTuning’s proposed structure-aware methodology. Scaling Law Verification: The proposed s
1.Refer to the educational processes of human students, the authors propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). This skillfully introduces structured data into continuous learning, and makes use of the characteristics of structured data to effectively improve the learning effect. 2.This paper first introduces the current challenges faced by LLM in learning domain-specif
The analysis of experimental results is too arbitrary: for example, the author directly fits the regular curve of line 468 after obtaining the performance of the large language model under three experimental conditions: 0.1, 0.3, and 0.5. In the absence of sufficient experimental data, it seems impossible to draw such a universal conclusion. Such an arbitrary conclusion may cast doubt on the reliability of the conclusions of this paper.
1. Significance of the Research Problem: This paper addresses the current challenges in enabling LLMs to learn specific domain knowledge. By designing the pre-training and fine-tuning strategy, it aims to improve the LLMs’ understanding and application of domain-specific knowledge, offering a approach to meet practical need. 2. Interesting Methodology: The proposed two-stage StructTuning approach ingeniously emulates human learning processes, gradually injecting structured domain knowledge in st
1. Unclear Experimental Setup: (1) In Open-ended Question Answering: For the LONGBENCH dataset, the division between the training and test sets is unclear. For example, as described in Section A.1, the authors generate a knowledge structure for each passage, but the sum of individual documents—reported as 200+200+150+200+200+200=1150—does not align with the 1350 entries stated on line 781 of this paper, raising concerns about the fairness and consistency of comparisons. Additionally, the referen
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies
