AI’s best-known “chatbot” (a nickname for AI that produces language), ChatGPT, is making waves in many industries, including health care. However, it has not yet mastered the art of test-taking—at least not in ophthalmology, gastroenterology, or urology.
ChatGPT Passed Medical Licensing Exam
In an early 2023 study, ChatGPT (just barely) passed the United States Medical Licensing Exam, a mandatory medical licensure requirement. The USMLE consists of three tests, with the first being administered to second-year medical students who usually dedicate 300 to 400 hours to exam preparation. Fourth-year medical school students take the second test, and practicing physicians who have typically completed one-half to one year of postgraduate studies take the third test. Passing all three grants license to practice medicine in the United States without supervision.
ChatGPT was given no special preparation for the exam.
“ChatGPT performed at >50 percent accuracy across all examinations, exceeding 60 percent in some analyses,” according to the study. The pass threshold varies yearly but is usually near 60 percent.
ChatGPT’s scores improved as new chatbot versions were tested, and the researchers even suggested that, in the future, the chatbot may help create the USMLE.
However, later, it failed three medical education exams.
ChatGPT Fails an Ophthalmology Test—Twice
Researchers at St. Michael’s Hospital in Toronto, Canada, measured ChatGPT’s test-taking skills in the field of ophthalmology. On a widely used practice exam for ophthalmology board certification, the chatbot answered only 46 percent of questions correctly on the first try. A month later, it increased its score to 58 percent correct answers.
It may be that ChatGPT exhibits the human trait of test-taking anxiety.
When given real-world ophthalmologic scenarios, however, the chatbot excelled. In one study, researchers gave ChatGPT 10 ophthalmology case studies to analyze. It provided the correct diagnosis for nine of the cases. Like the researchers in the USMLE tests, the authors assume that AI will only improve, stating: “Conversational AI models such as ChatGPT have potential value in the diagnosis of ophthalmic conditions, particularly for primary care providers.”
The Chatbot Fails in Gastroenterology
In a recent study published in The American Journal of Gastroenterology, ChatGPT-3 and ChatGPT-4 were given the American College of Gastroenterology self-assessment test. Both versions fell short of the 70 percent passing mark. On 455 questions, ChatGPT-3 scored 65.1 percent, while ChatGPT-4 scored slightly lower at 62.4 percent. This indicates that the newer version did not demonstrate improvement compared to its predecessor.
One must wonder, what about the “chat” in ChatGPT? Can the chatbot answer patient questions about gastrointestinal health?
Researchers asked ChatGPT 110 “real-life” questions. Experienced gastroenterologists evaluated the answers for accuracy, clarity, and efficacy. The result was not promising. The researchers concluded: “While ChatGPT has potential as a source of information, further development is needed” because the quality of that information is contingent upon the quality of the training data.
ChatGPT Flunks Urology Exam
In a recent experiment, researchers tested ChatGPT on 135 questions from the American Urological Association Self-Assessment Study Program. The purpose was to gauge the chatbot’s usefulness to students and physicians preparing for board exams.
ChatGPT accurately answered only 26.7 percent of the open-ended questions and 28.2 percent of the multiple-choice questions. The chatbot did not answer 15 of the multiple-choice questions, suggesting consultation with a doctor.
Interestingly, ChatGPT defended its incorrect answers and, according to study authors, “continuously reiterated the original explanation despite its being inaccurate.”
The authors concluded that if ChatGPT in medicine is not monitored or regulated, it could potentially contribute to the dissemination of inaccurate medical information.
Perhaps Test-Taking Is Not the Best Way to Test AI
While AI failed to make the grade in medical exams, it has demonstrated success in other domains. OpenAI keeps a list of tests its chatbot has passed, some with flying colors (pdf).
Are any of these tests the best way to measure intelligence? They certainly cannot measure genius, which—so far, anyway—is a distinctly human talent. They cannot measure kindness or compassion, which patients rank highly in what they look for in a doctor.
Maybe the best way to measure the efficacy of AI is with time. Though research into machine learning began in the 1930s, ChatGPT has been widely accessible for less than six months. AI has been utilized in medical research since the 1970s and will likely change in ways we cannot even imagine in the next few years.
Let us hope the change is for the better.