# Using transformer-based models for Vietnamese language detection

**Authors:** Son Tran, Phuoc Tran

PMC · DOI: 10.1371/journal.pone.0342898 · PLOS One · 2026-02-13

## TL;DR

This paper presents a method using Transformer-based models to detect Vietnamese text by leveraging its unique orthographic and contextual features.

## Contribution

The novel approach enhances Vietnamese text detection by addressing the language's complex orthography with Transformer models.

## Key findings

- Transformer-based models outperform traditional methods in Vietnamese text detection.
- The proposed approach achieves high accuracy and robustness on benchmark datasets.

## Abstract

This paper introduces a solution to the problem of detecting whether a sequence of text is Vietnamese based on its orthography and contextual features. For those unfamiliar with the language, it is known that understanding the meaning of certain texts can be challenging, since Vietnamese is a complex language that uses Latin characters with diacritics, and many of its words rely heavily on accent marks for semantic distinction. In this paper, we provide insight into how these characteristics influence Transformer-based natural language processing models and propose an approach to address this issue. Transformer-based models are selected due to their superior performance compared to earlier architectures such as RNNs and LSTMs, as well as their widespread application in state-of-the-art NLP systems (GPT, BERT, T5). We examine the specific challenges posed by Vietnamese orthography and word formation, and propose a solution that enhances the model’s ability to distinguish Vietnamese text. Our approach is evaluated on a benchmark dataset, demonstrating high accuracy and robustness in Vietnamese text detection, outperforming conventional methods. The results confirm that Transformer-based models can effectively learn orthographic and contextual patterns in Vietnamese, contributing to improved language identification and multilingual NLP processing.

## Full-text entities

- **Diseases:** OOM (MESH:D008569), Confusion (MESH:D003221)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12904414/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12904414/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/PMC12904414/full.md

---
Source: https://tomesphere.com/paper/PMC12904414