ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Khang T. Huynh; Dung H. Nguyen; Binh T. Nguyen

arXiv:2511.12249·cs.CL·November 18, 2025

ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Khang T. Huynh, Dung H. Nguyen, Binh T. Nguyen

PDF

Open Access 2 Models 1 Datasets

TL;DR

ViConBERT is a novel Vietnamese contextualized embedding model that leverages contrastive learning and gloss distillation, significantly improving semantic understanding tasks like WSD and similarity measurement.

Contribution

The paper introduces ViConBERT, the first Vietnamese contextualized embedding model combining contrastive learning and gloss-based distillation for better semantic representations.

Findings

01

Outperforms baselines on WSD with F1=0.87

02

Achieves AP=0.88 on ViCon similarity task

03

Spearman's rho=0.60 on ViSim-400

Abstract

Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

tkhangg0910/ViConWSD
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications