Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic
Thomas Compton

TL;DR
This paper introduces a transparent, hybrid framework for quantitative discourse analysis that combines lexical and semantic methods using Python tools and BERTopic, enhancing reproducibility and interpretability.
Contribution
It presents a novel, reproducible approach integrating lexical and semantic techniques in QDA with detailed Python pipelines and optimized topic modeling processes.
Findings
Improved topic coherence through parameter tuning.
Enhanced interpretability with combined lexical and semantic analysis.
Demonstrated reproducibility with open-source code.
Abstract
Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
