TAIA: Large Language Models are Out-of-Distribution Data Learners
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR
This paper introduces a novel inference-time method, TAIA, that improves large language models' performance in data-scarce, domain-mismatched scenarios by selectively fine-tuning attention parameters and inferring with only those parameters.
Contribution
The paper reveals that only attention parameters benefit from fine-tuning in mismatched domains and proposes TAIA, a method that enhances LLM performance by training all parameters but inferring with only attention.
Findings
TAIA outperforms fully fine-tuned and base models across multiple tasks.
Selective attention parameter tuning improves robustness to data mismatch.
TAIA enhances task-specific performance and resists jailbreaking tuning.
Abstract
Fine-tuning on task-specific question-answer pairs is a predominant method for enhancing the performance of instruction-tuned large language models (LLMs) on downstream tasks. However, in certain specialized domains, such as healthcare or harmless content generation, it is nearly impossible to obtain a large volume of high-quality data that matches the downstream distribution. To improve the performance of LLMs in data-scarce domains with domain-mismatched data, we re-evaluated the Transformer architecture and discovered that not all parameter updates during fine-tuning contribute positively to downstream performance. Our analysis reveals that within the self-attention and feed-forward networks, only the fine-tuned attention parameters are particularly beneficial when the training set's distribution does not fully align with the test set. Based on this insight, we propose an effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
