Semantic Outlier Removal with Embedding Models and LLMs

Eren Akbiyik; Jo\~ao Almeida; Rik Melis; Ritu Sriram; Viviana Petrescu; Vilhj\'almur Vilhj\'almsson

arXiv:2506.16644·cs.LG·June 23, 2025

Semantic Outlier Removal with Embedding Models and LLMs

Eren Akbiyik, Jo\~ao Almeida, Rik Melis, Ritu Sriram, Viviana Petrescu, Vilhj\'almur Vilhj\'almsson

PDF

Open Access 1 Video

TL;DR

SORE is a cost-effective, multilingual semantic outlier removal method that uses embeddings and approximate search to efficiently clean text data, outperforming traditional structural approaches.

Contribution

Introduces SORE, a novel semantic outlier removal technique leveraging embeddings and approximate search, achieving high accuracy with lower computational costs than LLMs.

Findings

01

Outperforms structural methods in HTML datasets

02

Achieves near-LLM extraction precision

03

Deployed in production processing millions of documents

Abstract

Modern text processing pipelines demand robust methods to remove extraneous content while preserving a document's core message. Traditional approaches such as HTML boilerplate extraction or keyword filters often fail in multilingual settings and struggle with context-sensitive nuances, whereas Large Language Models (LLMs) offer improved quality at high computational cost. We introduce SORE (Semantic Outlier Removal), a cost-effective, transparent method that leverages multilingual sentence embeddings and approximate nearest-neighbor search to identify and excise unwanted text segments. By first identifying core content via metadata embedding and then flagging segments that either closely match predefined outlier groups or deviate significantly from the core, SORE achieves near-LLM extraction precision at a fraction of the cost. Experiments on HTML datasets demonstrate that SORE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Semantic Outlier Removal with Embedding Models and LLMs· underline

Taxonomy

TopicsWeb Data Mining and Analysis · Handwritten Text Recognition Techniques · Topic Modeling