THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

Zhenke Duan; Xin Li

arXiv:2603.05972·cs.CY·April 15, 2026

THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science

Zhenke Duan, Xin Li

PDF

1 Repo

TL;DR

THETA is a novel hybrid embedding framework with an AI agent system that enhances scalable, domain-aware topic analysis in social science research, outperforming traditional models.

Contribution

It introduces a domain-adaptive fine-tuning method and an AI scientist agent framework to improve semantic understanding and interpretability in large-scale social data analysis.

Findings

01

THETA outperforms LDA, ETM, and CTM in domain-specific interpretability.

02

The framework maintains high coherence in topic modeling across six social science domains.

03

Open-source code is available for reproducibility and further research.

Abstract

The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CodeSoul-co/THETA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.