Skim-Aware Contrastive Learning for Efficient Document Representation
Waheed Ahmed Abro, Zied Bouraoui

TL;DR
This paper introduces a novel self-supervised contrastive learning framework inspired by human skimming to improve the efficiency and accuracy of long document representations, especially in legal and biomedical domains.
Contribution
It proposes a new contrastive learning method that masks document sections and aligns relevant parts using NLI-based objectives, enhancing long document understanding.
Findings
Significant improvements in accuracy on legal and biomedical datasets
Enhanced computational efficiency in document processing
Better capture of document context through skimming-inspired training
Abstract
Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Text Readability and Simplification
