SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts
Ruben Kruiper, Ioannis Konstas, Alasdair Gray, Farhad Sadeghineko,, Richard Watson, Bimal Kumar

TL;DR
This paper presents a cost-effective shallow parsing method for regulatory texts, using a small annotated dataset to identify key terms and multi-word expressions, aiding automated compliance checking.
Contribution
Introduces a shallow parsing approach with a small annotated dataset for building regulation texts, enabling efficient semantic parsing for compliance systems.
Findings
Achieved 79.93% F1-score on test set
Identified 89.84% of defined terms in regulation documents
Discovered multi-word expressions with 70.3% accuracy
Abstract
Automated Compliance Checking (ACC) systems aim to semantically parse building regulations to a set of rules. However, semantic parsing is known to be hard and requires large amounts of training data. The complexity of creating such training data has led to research that focuses on small sub-tasks, such as shallow parsing or the extraction of a limited subset of rules. This study introduces a shallow parsing task for which training data is relatively cheap to create, with the aim of learning a lexicon for ACC. We annotate a small domain-specific dataset of 200 sentences, SPaR.txt, and train a sequence tagger that achieves 79,93 F1-score on the test set. We then show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents, and that both contiguous and discontiguous Multi-Word Expressions (MWE) are discovered with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Engineering Research · Artificial Intelligence in Law
MethodsTest
