Selecting Auxiliary Data via Neural Tangent Kernels for Low-Resource Domains
Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

TL;DR
This paper introduces NTK-Selector, a novel method utilizing neural tangent kernels to select auxiliary data, significantly improving low-resource domain performance in large language models by effectively leveraging general-domain data.
Contribution
The paper proposes NTK-Selector, a new framework for auxiliary data selection using neural tangent kernels, addressing computational challenges and demonstrating substantial performance gains in low-resource domains.
Findings
NTK-Selector improves domain-specific performance significantly.
Enriching with selected auxiliary data yields 5-11x better results.
Empirical evidence shows stable NTK-like behavior in LLMs during fine-tuning.
Abstract
Large language models (LLMs) have achieved remarkable success across widespread tasks, yet their application in low-resource domains remains a significant challenge due to data scarcity and the high risk of overfitting. While in-domain data is limited, there exist vast amounts of similar general-domain data, and our initial findings reveal that they could potentially serve as auxiliary supervision for domain enhancement. This observation leads us to our central research question: \textbf{\textit{how to effectively select the most valuable auxiliary data to maximize domain-specific performance}}, particularly when traditional methods are inapplicable due to a lack of large in-domain data pools or validation sets. To address this, we propose \textbf{NTK-Selector}, a principled and efficient framework for selecting general-domain auxiliary data to enhance domain-specific performance via…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Significance: The paper addresses a highly important and practical problem "how to effectively select the most valuable auxiliary data to maximize domain-specific performance". This is particularly relevant for scenarios lacking large in-domain data pools or validation sets, where traditional methods often fail. 2. Computational Efficiency. The combination of a Jacobian-free approximation, calculating gradients only for LoRA modules, and applying random projection substantially reduces the
1. Generalisability of the Core Assumption: The paper's central claim of "NTK-like" stability relies heavily on empirical results from two specific models (LLAMA3-8B-INSTRUCT and QWEN3-8B) under LoRA fine-tuning. The study does not provide evidence that this stability holds for other parameter-efficient fine-tuning methods or for full model fine-tuning. Its applicability to different model architectures also remains unverified. 2. Empirical Basis of the NTK Approximation: The proposed Jacobian
The paper introduces an NTK-inspired approach to auxiliary data selection, creatively adapting kernel theory to LLM fine-tuning through LoRA gradients and random projections. This work targets a concrete and important challenge: improving model performance in data-limited domains. The experimental study evaluates the method across multiple domains (medical, financial, legal, psychological) and two strong LLMs, showing consistent positive trends. The paper is well-organized, easy to follow, an
The main comparison to the “Domain-Only” baseline is not compute-matched, NTK-Selector trains on roughly 10× more data and includes an additional LoRA warm-up stage, inflating perceived gains. The embedding-based warm-up uses the same domain data later used for fine-tuning, effectively giving the method a privileged view of the target domain compared to baselines. Missing simple yet critical controls such as (a) embedding-only pre-selection and (b) gradient-dot or Fisher-similarity baselines.
The paper addresses a highly relevant and critical problem: specializing LLMs in data-scarce environments. The core contribution is novel and theoretically motivated, successfully bridging the gap between NTK theory and the practical realities of fine-tuning massive LLMs. The experimental validation is rigorous and convincing, employing multiple modern models, diverse domains, and strong baselines. The proposed two-stage selection process, combining efficient embedding-based filtering with a mor
- Overstated improvement claims: The abstract claims "a 10.9x" improvement. This is calculated by dividing the absolute performance gain of NTK-Selector by the gain from Domain-Only fine-tuning. Since the Domain-Only gain is minimal and likely close to noise, this ratio exaggerates the method's effectiveness. Presenting the substantial absolute gains (e.g., +8.7 points) would be more direct and less sensational. - Justification of approximations is empirical: The method's validity rests on two k
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education
