Enhancing LLMs via High-Knowledge Data Selection

Feiyu Duan; Xuemiao Zhang; Sirui Wang; Haoran Que; Yuqi Liu; Wenge Rong; Xunliang Cai

arXiv:2505.14070·cs.CL·June 3, 2025

Enhancing LLMs via High-Knowledge Data Selection

Feiyu Duan, Xuemiao Zhang, Sirui Wang, Haoran Que, Yuqi Liu, Wenge Rong, Xunliang Cai

PDF

Open Access 1 Video

TL;DR

This paper introduces a gradient-free High-Knowledge Scorer for selecting knowledge-rich training data, significantly improving LLM performance in knowledge-intensive and domain-specific tasks.

Contribution

The paper proposes a novel knowledge-based data selection method that enhances LLM training by focusing on knowledge richness, addressing knowledge scarcity issues.

Findings

01

Improves model performance on knowledge-intensive tasks

02

Enhances domain-specific capabilities of LLMs

03

Effective in selecting high-knowledge training data

Abstract

The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing LLMs via High-Knowledge Data Selection· underline

Taxonomy

TopicsData Mining Algorithms and Applications · Mineral Processing and Grinding · Semantic Web and Ontologies