TFW2V: An Enhanced Document Similarity Method for the Morphologically Rich Finnish Language
Quan Duong, Mika H\"am\"al\"ainen, Khalid Alnajjar

TL;DR
This paper introduces TFW2V, a new method for measuring semantic similarity in Finnish texts, demonstrating high efficiency for long documents and limited data, with an evaluation framework for benchmarking.
Contribution
The paper proposes TFW2V, an improved document similarity method tailored for Finnish, and develops an objective benchmarking framework for text similarity approaches.
Findings
TFW2V outperforms existing methods on Finnish text similarity tasks.
The evaluation framework enables consistent benchmarking of different approaches.
TFW2V is effective for both long texts and small datasets.
Abstract
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language. At the same time, we propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data. Furthermore, we design an objective evaluation method which can be used as a framework for benchmarking text similarity approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
