Domain-Specific Word Embeddings with Structure Prediction

Stephanie Brandl; David Lassner; Anne Baillot; Shinichi; Nakajima

arXiv:2210.04962·cs.CL·October 12, 2022

Domain-Specific Word Embeddings with Structure Prediction

Stephanie Brandl, David Lassner, Anne Baillot, Shinichi, Nakajima

PDF

Open Access 1 Repo

TL;DR

This paper introduces W2VPred, a novel word embedding method that simultaneously captures general, domain-specific, and structural information, enabling dynamic and aligned embeddings across different corpora and domains.

Contribution

The paper presents a new embedding approach that models structure between sub-corpora and domains, outperforming baselines in analogy and structure prediction tasks.

Findings

01

W2VPred outperforms baselines in analogy tests.

02

It effectively predicts structure without prior information.

03

Demonstrated usefulness in Digital Humanities research.

Abstract

Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, e.g., across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provide general word representations for the whole corpus, domain-specific representations for each sub-corpus, sub-corpus structure, and embedding alignment simultaneously. We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests, domain-specific analogy tests, and multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stephaniebrandl/domain-word-embeddings
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Wikis in Education and Collaboration