TFW2V: An Enhanced Document Similarity Method for the Morphologically   Rich Finnish Language

Quan Duong; Mika H\"am\"al\"ainen; Khalid Alnajjar

arXiv:2112.12489·cs.CL·December 24, 2021·1 cites

TFW2V: An Enhanced Document Similarity Method for the Morphologically Rich Finnish Language

Quan Duong, Mika H\"am\"al\"ainen, Khalid Alnajjar

PDF

Open Access 1 Repo

TL;DR

This paper introduces TFW2V, a new method for measuring semantic similarity in Finnish texts, demonstrating high efficiency for long documents and limited data, with an evaluation framework for benchmarking.

Contribution

The paper proposes TFW2V, an improved document similarity method tailored for Finnish, and develops an objective benchmarking framework for text similarity approaches.

Findings

01

TFW2V outperforms existing methods on Finnish text similarity tasks.

02

The evaluation framework enables consistent benchmarking of different approaches.

03

TFW2V is effective for both long texts and small datasets.

Abstract

Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language. At the same time, we propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data. Furthermore, we design an objective evaluation method which can be used as a framework for benchmarking text similarity approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruathudo/tfw2v
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques