Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings
Xiaoyu Liang, Yuchen Peng, Jiale Luo, Wenhao Wang, Haoji Hu, Xincheng Zhou

TL;DR
This paper introduces LBR, a two-stage framework that combines generative and contrastive learning to enhance domain-specific embeddings in large language models, significantly improving performance in specialized fields.
Contribution
The paper proposes a novel two-stage learning paradigm that integrates knowledge injection with contrastive learning, addressing the limitations of existing methods in domain-specific LLM embeddings.
Findings
LBR outperforms strong baselines on medical, chemistry, and code retrieval tasks.
The approach effectively preserves domain knowledge while maintaining semantic alignment.
LBR establishes a new paradigm for vertical domain representation learning.
Abstract
Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL'' paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM's causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Machine Learning in Healthcare
