BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok

Minh Duc Chu; Kshitij Pawar; Zihao He; Roxanna Sharifi; Ross Sonnenblick; Magdalayna Curry; Laura D'Adamo; Lindsay Young; Stuart B Murray; Kristina Lerman

arXiv:2508.06515·cs.CV·January 27, 2026

BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok

Minh Duc Chu, Kshitij Pawar, Zihao He, Roxanna Sharifi, Ross Sonnenblick, Magdalayna Curry, Laura D'Adamo, Lindsay Young, Stuart B Murray, Kristina Lerman

PDF

1 Video

TL;DR

BigTokDetect is a clinically-informed vision-language framework that effectively detects pro-bigorexia videos on TikTok, leveraging a new expert-annotated multimodal dataset and demonstrating improved performance through multimodal fusion and fine-tuning.

Contribution

The paper introduces BigTokDetect, the first clinically-annotated multimodal benchmark dataset for pro-bigorexia content and evaluates vision-language models for automated detection with insights on model performance.

Findings

01

Supervised fine-tuning improves detection accuracy on fine-grained categories.

02

Multimodal fusion enhances model performance by 5-15%.

03

Video features are the most discriminative signals for detection.

Abstract

Social media platforms face escalating challenges in detecting harmful content that promotes muscle dysmorphic behaviors and cognitions (bigorexia). This content can evade moderation by camouflaging as legitimate fitness advice and disproportionately affects adolescent males. We address this challenge with BigTokDetect, a clinically informed framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal benchmark dataset of over 2,200 TikTok videos labeled by clinical psychiatrists across five categories and eighteen fine-grained subcategories. Comprehensive evaluation of state-of-the-art vision-language models reveals that while commercial zero-shot models achieve the highest accuracy on broad primary categories, supervised fine-tuning enables smaller open-source models to perform better on fine-grained subcategory detection.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BigTokDetect: A Clinically-Informed Vision-Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok· underline