# Revolutionizing Educational Assessment: The Role of Bing Chat GPT-4 Chatbot in Increasing Efficiency in Grading Open-Ended Questions

**Authors:** Diane Abadam-Eremeev, Rachel P Seng, Janani Srinivasan, Kyle Knouna, Joseph R Bogaudo, Paulo A Rodriguez De La Nuez, Alexey Podcheko

PMC · DOI: 10.7759/cureus.100508 · Cureus · 2025-12-31

## TL;DR

This study compares Bing Chat AI and university faculty in grading open-ended medical questions, finding that Bing Chat is more consistent and reliable.

## Contribution

This paper introduces Bing Chat as a novel AI tool for grading open-ended questions in medical education with high inter-rater reliability.

## Key findings

- Bing Chat showed higher consistency than faculty in grading open-ended questions.
- Bing Chat's grading closely matched faculty for problem-solving and recall-type questions.
- Faculty grading had more variability compared to Bing Chat's scores.

## Abstract

Background

Bing Chat (Microsoft Corporation, Redmond, WA) is an artificial intelligence (AI) program that can respond to typed text. Many educational institutions are exploring the incorporation of AI into their materials, given its rapid advancements and potential to streamline various processes. One area that still requires investigation is the ability of AI models to grade open-ended questions. The purpose of this study is to determine whether Bing Chat or university faculty exhibit greater consistency in grading such questions.

Methods

The authors recruited 21 medical students from the American University of the Caribbean (AUC) to answer five open-ended questions related to the United States Medical Licensing Examination (USMLE) Step 1 topics. The volunteer participants consisted of first-year and second-year medical students. The responses of each student to the five questions were graded by six different Bing Chat accounts and six faculty members. Differences in scores between Bing Chat and the faculty members were compared, and inter-rater reliability estimates were calculated.

Results

Both Bing Chat and faculty consistently measured the same responses; although there was some variability in both cases, it was more pronounced in faculty grading. For analysis, problem-solving questions with elements of application, explanatory, and recall-type questions, Bing Chat’s grading closely paralleled that of the faculty. However, when grading a combined recall-and-application question, a significant gap between Bing Chat and faculty scores was observed (p = 0.010). Overall, Bing Chat demonstrated higher inter-rater reliability than faculty, as evidenced by both percent agreement and Gwet's agreement coefficient 1 (AC1).

Conclusion

Bing Chat demonstrated promising results in evaluating written answers to open-ended questions and shows potential as a supportive grading tool. As educational leaders seek more dependable, faster, and economical methods for assessment, Bing Chat may offer a notable contribution to education. Large language models (LLMs), such as Bing Chat, can be beneficial to both students and educators.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12857850/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12857850/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/PMC12857850/full.md

---
Source: https://tomesphere.com/paper/PMC12857850