Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection
Fariz Ikhwantri, Dusica Marijan

TL;DR
This paper investigates data selection methods to improve cross-domain transfer in automated compliance detection, demonstrating that targeted data selection reduces negative transfer and enhances scalability.
Contribution
It systematically evaluates four data selection approaches to mitigate negative transfer in compliance detection framed as an NLI task, advancing cross-domain adaptation techniques.
Findings
Targeted data selection significantly reduces negative transfer.
Embedding-based retrieval improves cross-domain adaptation.
Varying data proportions affects transfer effectiveness.
Abstract
Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
