Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, Ge, Yu

TL;DR
This paper introduces SANTA, a structure-aware pretraining method for language models that improves dense retrieval of structured data by aligning structured and unstructured data and predicting masked entities, achieving state-of-the-art results.
Contribution
The paper proposes two novel pretraining techniques, Structured Data Alignment and Masked Entity Prediction, to enhance language models' ability to handle structured data for retrieval tasks.
Findings
SANTA achieves state-of-the-art results on code and product search.
It performs well in zero-shot retrieval settings.
The methods effectively learn structural semantics in language models.
Abstract
This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
MethodsAttentive Walk-Aggregating Graph Neural Network
