# ViClickbait-2025: A comprehensive dataset for Vietnamese clickbait detection

**Authors:** Dai Phuoc Nguyen, Thien Khai Tran, Y Minh Nguyen, Bay Vo

PMC · DOI: 10.1016/j.dib.2025.112164 · 2025-10-10

## TL;DR

ViClickbait-2025 is a Vietnamese dataset for identifying clickbait headlines, containing 3414 annotated samples from online news platforms.

## Contribution

The dataset introduces a standardized Vietnamese clickbait detection resource with detailed annotations and high inter-annotator agreement.

## Key findings

- 31.2% of the headlines in the dataset are labeled as clickbait.
- The dataset includes nine attributes such as headline text, metadata, and simulated engagement metrics.
- Inter-annotator agreement reached a Cohen’s Kappa of 0.822, indicating strong reliability.

## Abstract

ViClickbait-2025 is a curated Vietnamese-language dataset developed to facilitate research on automatic clickbait detection. It comprises 3414 headline samples collected through web scraping from eight major Vietnamese online news platforms between 2023 and 2025. Each headline is annotated as either clickbait or non-clickbait, with 31.2 % labeled as clickbait. The dataset includes nine key attributes, covering headline text, metadata, article summaries, and simulated engagement indicators. A preprocessing pipeline was applied to remove HTML noise, eliminate duplicates, and normalize the data. Annotation was carried out by three independent reviewers using a standardized guideline, with inter-annotator agreement reaching a Cohen’s Kappa of 0.822. Disagreements were resolved by a fourth annotator, and inconclusive cases were excluded. The final dataset spans 13 news categories and is released in JSONL and CSV formats under a CC BY 4.0 license.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12557553/full.md

---
Source: https://tomesphere.com/paper/PMC12557553