Don't Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Debjyoti Saha Roy; Byron C. Wallace; Javed A. Aslam

arXiv:2410.23066·cs.CL·December 29, 2025

Don't Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

PDF

Open Access 3 Reviews

TL;DR

PLANT is a novel attention initialization method for extreme multi-label text classification that leverages a pretrained Learning-to-Rank model, significantly improving performance especially in few-shot and rare label scenarios.

Contribution

Introduces PLANT, a plug-and-play attention initialization technique using Learning-to-Rank guidance, compatible with various large language models, enhancing classification accuracy.

Findings

01

Outperforms state-of-the-art methods across multiple tasks.

02

Significant improvements in few-shot and rare label settings.

03

Attention initialization is a key factor in performance gains.

Abstract

State-of-the-art Extreme Multi-Label Text Classification models rely on multi-label attention to focus on key tokens in input text, but learning good attention weights is challenging. We introduce PLANT - Pretrained and Leveraged Attention - a plug-and-play strategy for initializing attention. PLANT works by planting label-specific attention using a pretrained Learning-to-Rank model guided by mutual information gain. This architecture-agnostic approach integrates seamlessly with large language model backbones such as Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3. PLANT outperforms state-of-the-art methods across tasks including ICD coding, legal topic classification, and content recommendation. Gains are especially pronounced in few-shot settings, with substantial improvements on rare labels. Ablation studies confirm that attention initialization is a key driver of these gains. For code…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

1. The paper is well written and easy to understand. The goals, problem formulation etc. are well defined. 2. Table 2 demonstrates that introduction of PLANT improves over the baselines without PLANT initialization. The results are consistents for multiple LLM backbones, i.e., Mistral-7B, LLaMA3-8B, DeepSeek-V3, and Phi-3. 3. Tables 3 compares PLANT against existing methods on the MIMIC-IV-full. It outperforms existing methods (GPT-4 Zero-Shot, GKI-ICD, PLM-CA, and CoRelation) on AUC, F1 and pr

Weaknesses

1. Although the paper presents good results on public datasets till 31K labels. It would be good to see the results on larger datasets. For example, product-to-product recommendation (LF-AmazonTitles-131K or LF-AmazonTitles-1.3M [1]) or tag-prediction (LF-Wikipedia-500K). Please note that Wikipedia dataset considers full-text documents which aligns with the setup. 2. The paper misses out on discussion on scalability. It would be good to report training and inference times. 3. Which backbone is u

Reviewer 02Rating 6Confidence 4

Strengths

This paper is clearly written and easy to follow.

Weaknesses

1. PLANT includes additional MIG pre-computation and L2R stages, but the paper does not provide a comparison of the time, memory, or parameter amounts for these stages, nor does it explain their incremental contribution to the total training cost. 2. The attention matrix learned by PLANT in Stage 1 is directly used in Stage 2, but without any maintenance or freezing strategy. If Stage 2 training is too long, the Stage 1 signals may be overwritten, causing the "planted attention" to fail. 3. Alth

Reviewer 03Rating 6Confidence 4

Strengths

- S1: The proposed method introduces a simple yet effective task-specific pretraining objective for attention initialization. - S2: The authors conduct comprehensive experiments showing consistent improvements across multiple datasets and various LLM backbones. - S3: The method achieves substantial gains on rare labels and under few-shot settings, indicating better generalization to low-data regimes.

Weaknesses

- W1: The learning-to-rank objective based on mutual information gain relies on sufficient co-occurrence statistics, which may not generalize well to low-resource domains or languages with limited training data. - W2: The MIG computation and learning-to-rank pretraining may be computationally expensive for large label spaces; the paper does not discuss practical scalability or runtime considerations. - W3: The paper lacks qualitative analyses or visualizations of the learned attention patterns

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Sentiment Analysis and Opinion Mining

MethodsSoftmax · Attention Is All You Need · Focus