TL;DR
THETA is a novel hybrid embedding framework with an AI agent system that enhances scalable, domain-aware topic analysis in social science research, outperforming traditional models.
Contribution
It introduces a domain-adaptive fine-tuning method and an AI scientist agent framework to improve semantic understanding and interpretability in large-scale social data analysis.
Findings
THETA outperforms LDA, ETM, and CTM in domain-specific interpretability.
The framework maintains high coherence in topic modeling across six social science domains.
Open-source code is available for reproducibility and further research.
Abstract
The explosion of big social data has created a scalability trap for traditional qualitative research, as manual coding remains labor-intensive and conventional topic models often suffer from semantic thinning and a lack of domain awareness. This paper introduces Textual Hybrid Embedding based Topic Analysis (THETA), a novel computational paradigm and open-source tool designed to bridge the gap between massive data scale and rich theoretical depth. THETA moves beyond frequency-based statistics by implementing Domain-Adaptive Fine-tuning (DAFT) via LoRA on foundation embedding models, which effectively optimizes semantic vector structures within specific social contexts to capture latent meanings. To ensure epistemological rigor, we encapsulate this process into an AI Scientist Agent framework, comprising Data Steward, Modeling Analyst, and Domain Expert agents, to simulate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
