A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition
Jiang Xiaobo, Dinghong Lai, Song Qiu, Yadong Deng, Xinkai Zhan

TL;DR
This paper investigates how low information density in user-generated content causes NER performance issues, introduces a new analysis method, and proposes a model-agnostic optimization framework that improves NER accuracy on noisy datasets.
Contribution
It identifies information density as a key factor affecting NER in UGC, introduces Attention Spectrum Analysis, and proposes the Window-Aware Optimization Module to enhance semantic density and performance.
Findings
WOM improves NER F1 scores by up to 4.5% on UGC datasets.
WOM achieves new state-of-the-art results on WNUT2017.
ASA quantifies how reduced information density causes attention blunting.
Abstract
Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
