A Large-Scale Benchmark for Vietnamese Sentence Paraphrases
Sang Quang Nguyen, Kiet Van Nguyen

TL;DR
This paper introduces ViSP, a large-scale Vietnamese sentence paraphrase dataset of 1.2 million pairs, created through a hybrid automatic and manual process, and evaluates various models including LLMs for paraphrasing tasks.
Contribution
The paper presents the first large-scale Vietnamese paraphrase dataset and provides comprehensive experiments with multiple models, including state-of-the-art LLMs, for Vietnamese paraphrasing.
Findings
ViSP is a high-quality, large-scale dataset for Vietnamese paraphrasing.
Baseline models and LLMs show promising results on the dataset.
The study establishes a foundation for future Vietnamese NLP research.
Abstract
This paper presents ViSP, a high-quality Vietnamese dataset for sentence paraphrasing, consisting of 1.2M original-paraphrase pairs collected from various domains. The dataset was constructed using a hybrid approach that combines automatic paraphrase generation with manual evaluation to ensure high quality. We conducted experiments using methods such as back-translation, EDA, and baseline models like BART and T5, as well as large language models (LLMs), including GPT-4o, Gemini-1.5, Aya, Qwen-2.5, and Meta-Llama-3.1 variants. To the best of our knowledge, this is the first large-scale study on Vietnamese paraphrasing. We hope that our dataset and findings will serve as a valuable foundation for future research and applications in Vietnamese paraphrase tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Inverse Square Root Schedule · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention
