# Semantic Matching of Documents from Heterogeneous Collections: A Simple   and Transparent Method for Practical Applications

**Authors:** Mark-Christoph M\"uller

arXiv: 1904.12550 · 2019-04-30

## TL;DR

This paper introduces a simple, unsupervised, and transparent method for pairwise document matching across heterogeneous collections, outperforming more complex systems in a binary classification task.

## Contribution

The authors propose a novel, straightforward approach that uses standard resources and explicit similarity computations, enhancing transparency and efficiency in document matching.

## Key findings

- Outperforms complex baseline systems in Concept-Project matching
- Uses only standard, pre-computable word similarities
- Provides explicit, interpretable similarity scores

## Abstract

We present a very simple, unsupervised method for the pairwise matching of documents from heterogeneous collections. We demonstrate our method with the Concept-Project matching task, which is a binary classification task involving pairs of documents from heterogeneous collections. Although our method only employs standard resources without any domain- or task-specific modifications, it clearly outperforms the more complex system of the original authors. In addition, our method is transparent, because it provides explicit information about how a similarity score was computed, and efficient, because it is based on the aggregation of (pre-computable) word-level similarities.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.12550/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1904.12550/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/1904.12550/full.md

---
Source: https://tomesphere.com/paper/1904.12550