Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution
Julia Matela, Frank Kr\"uger

TL;DR
This paper presents a hybrid system combining semantic embeddings, knowledge base lookup, and density-based clustering to improve cross-document software coreference resolution, achieving high F1 scores across multiple subtasks.
Contribution
It introduces a novel hybrid framework integrating semantic embeddings, KB lookup, and density clustering for effective cross-document software mention clustering.
Findings
Achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on different subtasks.
Effectively handled large-scale coreference resolution with blocking strategies.
Improved canonical name matching through surface-form normalization.
Abstract
This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Biomedical Text Mining and Ontologies · Topic Modeling
