ChatGPT Answers the 110-Question Laboratory Enzymology Student Exam: Pass or Fail?
Berina Hasanefendic, Aleksandra Pasic, Selvedina Duskan, Emir Sehercehajic, Altaira Jazic Durmisevic

TL;DR
This study tested ChatGPT's knowledge on a lab enzymology exam and found it scored lower than students, suggesting it could help in education but not replace humans.
Contribution
The study compares ChatGPT versions and students on a lab enzymology exam, revealing AI's strengths and weaknesses in this domain.
Findings
ChatGPT-4.0 scored higher than ChatGPT-3.5 but still lower than students.
Chatbots performed best on metabolism-related questions and worst on enzyme lab analysis.
ChatGPT scored below the passing threshold of 60%.
Abstract
Introduction Chatbots like ChatGPT have attracted a lot of interest lately due to their ability to generate human-like responses. Their reliability and accuracy are still questionable, and they are the topic of many studies in different fields. Therefore, the aim of this study was to examine the knowledge of two versions of chatbots regarding laboratory enzymology and to compare it with the average knowledge of students for the purpose of considering the use of ChatGPT in providing answers in this field. Material and methods An exam with 110 questions covering four topics was answered by students and ChatGPT-3.5 and ChatGPT-4.0. The accuracy of the answers of 52 students and ChatGPT was evaluated. The accuracy of answers between students and artificial intelligence was compared, and the percentage of passing the exam was 60%. All responses were reviewed by two authors with full…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1| Fields of questions | Number of questions (N) | Percentage of questions in the test (%) |
| Enzymes and enzyme reactions | 48 | 43.64 |
| Enzymes in metabolism | 14 | 12.73 |
| Clinical enzymology | 40 | 36.36 |
| Laboratory analysis of enzymes | 8 | 7.27 |
| Students | GPT-3.5 | GPT-4.0 | Students vs. GPT-3.5 | Students vs. GPT-4.0 | GPT-3.5 vs. GPT-4.0 | |
| Enzymes and enzyme reactions | 40 (85.41) | 26 (54.17) | 35 (72.91) | 0.192 | 0.001 | 0.236 |
| Enzymes in metabolism | 12 (85.71) | 8 (57.14) | 11 (78.57) | 0.046 | 0.005 | 0.269 |
| Clinical enzymology | 35 (87.5) | 21 (52.5) | 31 (77.5) | 0.002 | 0.173 | 0.323 |
| Laboratory analysis of enzymes | 7 (87.5) | 3 (37.5) | 5 (62.5) | 0.618 | 0.015 | 0.108 |
| Overall | 94 (85.46) | 58 (52.73) | 82 (74.55) | 0.003 | 0.001 | 0.001 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Biomedical and Engineering Education · Radiomics and Machine Learning in Medical Imaging
Introduction
At the end of 2022, the development of artificial intelligence led to the emergence of ChatGPT based on the Large Language Model (LLM). Chatbots (interactive software applications) based on LLMs have the ability to process text questions and provide text answers based on neural networks and deep learning of millions of data. Generative Pretrained Transformer (GPT) currently provides two versions, GPT-3.5 and GPT-4, which use a generative model to create output text in the form of a human-like answer from an input question [1]. The application of a new form of virtual tutor-ChatGPT-and its application has been investigated in various fields such as economics, industry, natural sciences, medicine, and education. Studies conducted so far show relatively good results and advise the use of these tools in learning and solving many tasks and problems [2,3].
Many studies have been conducted in recent years to analyze the accuracy and reliability of the information provided by ChatGPT in medicine and healthcare [4,5]. Given the demonstrated good results and easy accessibility, many students of medicine and health and related sciences have experienced the opportunities offered. The answers obtained from artificial intelligence are used mainly for writing essays, answering exam questions, and helping or clarifying doubts in acquiring new knowledge and skills. With the aim of implementing artificial intelligence in medical education, many professional associations, the academic community, and individual experts were used as references and credible evaluators of the validity of the level of capabilities of ChatGPT [6,7].
Laboratory enzymology as a subfield of clinical biochemistry represents an important segment in the education of laboratory professionals, focusing on the structure, role, and clinical significance of enzymes in the human body with a focus on laboratory analysis of enzymes. By studying this subfield, graduates acquire expanded knowledge about enzymes and enzymatic reactions, as well as competencies for their clinical and experimental testing [8]. Therefore, the aim of this study is to compare the average student's knowledge base against artificial intelligence on laboratory enzymology, using a student exam to examine the reliability and accuracy of ChatGPT answers. Based on the correct answers, the usefulness of the two versions of ChatGPT in providing accurate information in the field of laboratory enzymology will be evaluated.
Materials and methods
Study design
The Laboratory Enzymology exam is an integral part of the graduate study of laboratory technology, after which students acquire the knowledge and competencies necessary for medical, clinical, and experimental enzyme testing. The exam consisted of 110 (60 for the first and 50 for the second midterm) multiple-choice questions, where only one answer is correct, or an explanation is required (Appendix). The questions were created by the first author (associate professor). A preliminary list of questions was sent to other co-authors for review, after which grammatical or semantic corrections were suggested, which were accepted. The corrected list of questions represented the final student exam. The following areas were covered: introduction to enzymes and enzymatic reactions, enzymes in metabolism, clinical enzymology, and laboratory analysis of enzymes.
The exam was taken by 52 students in two terms-the first and second midterms, each lasting 45 minutes. To pass the exam, it was necessary to achieve more than 60% correct answers.
ChatGPT
On October 20, 2024, 110 test questions were created in text form for ChatGPT versions 3.5 and 4.0. Each question included instructions, such as "Which of the given answers is correct?" The answers from both versions were recorded and reviewed by two authors. After evaluating the answers, there was no noticeable difference in the test scores between the two versions. Both versions were considered as 53rd and 54th students.
Statistical analysis
After testing for normality of data distribution, non-normal distribution was shown, and non-parametric tests were used for analysis. The Kruskal-Wallis test was used to compare the achieved test scores for three types of participants. All analyses were performed in IBM Corp. Released 2017. IBM SPSS Statistics for Windows, Version 25.0. Armonk, NY: IBM Corp. P<0.05 was considered significant.
Results
Students and both versions of ChatGPT completed a laboratory enzymology test consisting of 110 questions, mostly from the fields of enzymes and enzyme reactions and clinical enzymology, as shown in Table 1.
Total scores for students, ChatGPT-3.5, and ChatGPT-4.0 are indicators that students and ChatGPT-4.0 had more than 60% correct answers and thus passed the exam, while ChatGPT-3.5 scored lower, as shown in Table 2. When comparing the total scores between students and ChatGPT-3.5, students and ChatGPT-4.0, and ChatGPT-3.5 and ChatGPT-4.0, a significant difference was shown, p = 0.003, p = 0.001, and p = 0.001, respectively. Significant differences were also shown in the differences in scores achieved in the subfields: enzymes and enzymatic reactions-students and ChatGPT-4.0 (p = 0.001); enzymes in metabolism-students and ChatGPT-3.5 (p = 0.046) and students and ChatGPT-4.0 (p = 0.005); clinical enzymology-students and ChatGPT-3.5 (p = 0.002); and laboratory analysis of enzymes-students and ChatGPT-4.0 (p = 0.015).
Table 2: Scores achieved on the laboratory enzymology test by students, ChatGPT-3.5, and ChatGPT-4.0Scores are represented as the number of correct answers, N, and the percentage of correct answers, %, out of the total number of questions. The Kruskal-Wallis test was used to compare the achieved test scores. Statistical significance was set at level p < 0.05.
Students achieved the highest score of correct answers in Clinical Enzymology 35/40 (87.5%), ChatGPT-3.5 in Enzymes and Enzyme Reactions 35/48 (54.17%), and ChatGPT-4.0 in Enzymes in Metabolism 11/14 (78.57%), as shown in Figure 1.
Total scores and subfield scores for students, ChatGPT-3.5, and ChatGPT-4.0
Discussion
In our conducted study, we examined and evaluated the performance of both versions of ChatGPT and students’ knowledge of basic and clinical knowledge in laboratory enzymology. The condition for passing the exam is to obtain more than 60% points, and according to this, students (85.46%) and ChatGPT-4.0 (74.55%) passed, while ChatGPT-3.5 (52.73%) did not achieve a sufficient score to pass the test. These results provide a better and closer insight into the possibility of applying chatbots in medical education-enzymology as part of biochemistry in this case.
Optimizing the learning process and exam preparation undoubtedly keeps pace with the development of new technologies. Newer technologies, such as artificial intelligence, have so far shown their potential and wide use among students due to their easy accessibility via web browsers, even without the necessary creation of a user account. Therefore, researchers from many medical fields have aimed to test the credibility and accuracy of the information obtained, often with reference sources [9]. Previously conducted studies carried out so far have shown high scores on tests of different purposes and different medical fields [10-12].
To the best of our knowledge, previous research did not examine the knowledge of the subdisciplines of medical biochemistry, such as laboratory enzymology. Previous studies have shown that students achieved average scores in medical biochemistry exams: 58% in the study of Surapaneni KM et al. [13], 59.3% in the study of Luke WANV et al. [14], and 70% in the study of Ghosh A et al. [15]. Although ChatGPT proved in the mentioned studies that it can be a useful tool for solving student exams in medical biochemistry, in a study by the same author, Surapaneni KM, several shortcomings (decreasing precision with the increasing complexity of questions, need to validate answers, differences in answers in two independent sessions) and poorer performance in solving clinical cases from this field was shown [16]. Additionally, our study suggests that chatbots can be a useful platform for obtaining information, but with variations between versions in the case of laboratory enzymology.
Limitations
The conducted study has several design and performance limitations. Firstly, the exam for ChatGPT was conducted only once. The accuracy of the answers was not tested more than once. The test questions did not involve recognizing images or solving any mathematical operations, which would be good indicators of the capabilities of artificial intelligence. As a research instrument, a self-administered test prepared by the author based on his knowledge and many years of experience was used instead of a validated instrument. The reason for this is the unavailability of some form of instrument for assessing students' knowledge of laboratory enzymology. At the same time, we believe that our test form could be used as a basis for a validated test form.
Conclusions
In our study, we examined and investigated the reliability of ChatGPT in answering questions from different aspects of laboratory enzymology. The results showed that ChatGPT has the ability to answer basic and clinically based-questions in this field. In this study, ChatGPT-4.0 scored better than ChatGPT-3.5.
Furthermore, we conclude that this study can be used to assess the possibility of implementing chatbots in higher and/or medical education. Our results showed that ChatGPT-4.0 is more reliable and accurate and has greater potential in the field of laboratory enzymology compared to ChatGPT-3.5. Based on the average score achieved by both versions, we believe that ChatGPT can be a good learning tool with additional software optimizations, but it still cannot replace a human.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Large language model, AI and scientific research: why Chat GPT is only the beginning J Neurosurg Sci Zangrossi P Martini M Guerrini F 2162246820243826130710.23736/S 0390-5616.23.06171-4 · doi ↗ · pubmed ↗
- 2Chat GPT in action: Harnessing artificial intelligence potential and addressing ethical challenges in medicine, education, and scientific research World J Methodol Jeyaraman M Ramasubramanian S Balaji S 1701781320233777186710.5662/wjm.v 13.i 4.170PMC 10523250 · doi ↗ · pubmed ↗
- 3Chat GPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology Eye (Lond) Kedia N Sanjeev S Ong J Chhablani J 125212613820243817258110.1038/s 41433-023-02915-z PMC 11076576 · doi ↗ · pubmed ↗
- 4Evaluating the reliability of Chat GPT for health-related questions: a systematic review Informatics Beheshti M Toubal IE Alaboud K 9122025
- 5Reliability of medical information provided by Chat GPT: assessment against clinical guidelines and patient information quality instrument J Med Internet Res Walker HL Ghani S Kuemmerli C 025202310.2196/47479 PMC 1036557837389908 · doi ↗ · pubmed ↗
- 6Utilization of Chat GPT in medical education: applications and implications for curriculum enhancement Acta Inform Med Ahmed Y 3003053120233837969010.5455/aim.2023.31.300-305PMC 10875960 · doi ↗ · pubmed ↗
- 7Chat GPT in medicine: A cross-disciplinary systematic review of Chat GPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction Medicine (Baltimore) Fatima A Shafique MA Alam K 0103202410.1097/MD.0000000000039250 PMC 1131554939121303 · doi ↗ · pubmed ↗
- 8Enzymes: principles and biotechnological applications Essays Biochem Robinson PK 1415920152650424910.1042/bse 0590001 PMC 4692135 · doi ↗ · pubmed ↗
