How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning
Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang

TL;DR
This paper systematically studies how different attention masking strategies affect user representation learning in decoder-only LLMs, proposing a gradient-guided soft masking method to improve training stability and embedding quality.
Contribution
It introduces Gradient-Guided Soft Masking for better transition from causal to bidirectional attention in large-scale user modeling.
Findings
Gradient-Guided Soft Masking improves training stability.
Bidirectional attention yields higher-quality user embeddings.
Method outperforms baseline masking strategies on multiple benchmarks.
Abstract
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
