Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro; Zied Bouraoui

arXiv:2512.24373·cs.CL·January 1, 2026

Skim-Aware Contrastive Learning for Efficient Document Representation

Waheed Ahmed Abro, Zied Bouraoui

PDF

Open Access

TL;DR

This paper introduces a novel self-supervised contrastive learning framework inspired by human skimming to improve the efficiency and accuracy of long document representations, especially in legal and biomedical domains.

Contribution

It proposes a new contrastive learning method that masks document sections and aligns relevant parts using NLI-based objectives, enhancing long document understanding.

Findings

01

Significant improvements in accuracy on legal and biomedical datasets

02

Enhanced computational efficiency in document processing

03

Better capture of document context through skimming-inspired training

Abstract

Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Text Readability and Simplification