The Influence of Domain-Based Preprocessing on Subject-Specific Clustering
Alexandra Gkolia, Nikhil Fernandes, Nicolas Pizzo, James Davenport and, Akshar Nair

TL;DR
This paper investigates how domain-specific preprocessing, such as tagging code excerpts, improves the accuracy of subject-specific clustering of student queries in an online university setting.
Contribution
It introduces a domain-based preprocessing technique that tags technical terms, particularly code snippets, to enhance clustering effectiveness in educational data.
Findings
Tagging code excerpts improves clustering accuracy.
Domain-specific preprocessing reduces misclassification of technical terms.
Empirical results support the effectiveness of the proposed method.
Abstract
The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even equivalent. One way to deal with this problem is to cluster these questions depending on their topic. In our previous work, we aimed to find an improved method of clustering that would give us a high efficiency, using a recurring LDA model. Our data set contained questions posted online from a Computer Science course at the University of Bath. A significant number of these questions contained code excerpts, which we found caused a problem in clustering, as certain terms were being considered as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Bayesian Methods and Mixture Models · Algorithms and Data Compression
MethodsLinear Discriminant Analysis
