The Performance of AI on the Dentistry Specialization Exam

https://doi.org/10.1016/j.identj.2024.07.921Get rights and content
Under a Creative Commons license
Open access

AIM or PURPOSE

This study aims to compare the performance of four large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, ChatGPT4, Google Gemini, and Microsoft Copilot on the restorative dentistry-related questions from Dentistry Specialization Exam (DUS).

MATERIALS and METHOD

A total of 100 multiple-choice questions were used. LLMs were asked questions acquired from two sources: DUS questions were downloaded from the database of ÖSYM, a state institution established by the Turkish Grand National Assembly that evaluates and places candidates who want to be accepted to higher education programs through central exams, and DUS-style question bank provided by educators in university. Image-based questions were excluded, as ChatGPT only takes text-based inputs. The chi-square test was applied for statistical analysis.

RESULTS

The overall correct response rates were 50% for Gemini, 49% for GPT-4, 44% for Copilot, and 42% for GPT-3.5. In the category of case-based questions, while Gemini and Copilot achieved a correct response rate of 46%, ChatGPT-4 attained 42%, and ChatGPT-3.5 did 38%. For knowledge-based questions, the correct response rates were 56% for GPT-4, 54% for Gemini, 46% for GPT-3.5, and 42% for Copilot. The differences among the models were not statistically significant.

CONCLUSION(S)

The research underscores the differing effectiveness of various LLMs across a range of question types. Although not currently performing as intended due to false positive responses might be a problem, however, with further research, the capabilities of LLMs in educational contexts are likely to evolve and improve.

Cited by (0)