Evaluating NLP Embedding Models for Handling Science-Specific Symbolic Expressions in Student Texts
Tom Bleckmann, Paul Tschisgale

TL;DR
This paper evaluates how well various NLP embedding models interpret science-specific symbolic expressions in student texts, highlighting the importance of model selection for educational data mining involving scientific language.
Contribution
It provides a comparative analysis of embedding models' ability to process symbolic expressions in science-related student texts, emphasizing the performance of GPT-text-embedding-3-large.
Findings
GPT-text-embedding-3-large outperforms other models
Significant differences exist among models in handling symbolic expressions
Model choice impacts analysis quality in educational NLP applications
Abstract
In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics · Topic Modeling
