# Information Extraction in Illicit Domains

**Authors:** Mayank Kejriwal, Pedro Szekely

arXiv: 1703.03097 · 2017-03-10

## TL;DR

This paper introduces a lightweight, feature-agnostic information extraction method tailored for illicit domains like human trafficking, capable of learning from minimal annotations and handling concept drift effectively.

## Contribution

The authors propose a novel IE paradigm that requires minimal seed annotations, is robust to concept drift, and outperforms traditional feature-based models on real-world datasets.

## Key findings

- Outperforms CRF baselines by over 18% F-Measure
- Effective with as few as 12 seed annotations
- Robust to concept drift and efficient to bootstrap

## Abstract

Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.03097/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1703.03097/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/1703.03097/full.md

---
Source: https://tomesphere.com/paper/1703.03097