The capabilities, limitations and risks of generative AI are currently a topic of major interest, with widely varying predictions about the roles that language models may one day be able to fill. An area that's frequently brought up in this regard is healthcare, and this study is certainly food for thought.
Researches from several institutions, including the School of Clinical Medicine at the University of Cambridge, tasked several language models with answering a large number of ophthalmology-based multiple-choice questions taken from a medical textbook. The same questions were shown to eye doctors, junior eye doctors, and junior doctors who haven't yet picked a specialty, with the latter group intended to roughly correspond to the level of ophthalmology knowledge that might be expected from a GP. The full results are published here.
To summarise, the language model performance varied widely but the best results were from a model called GPT-4, which answered 69% of the questions correctly. That was significantly better than the unspecialised doctors (43% on average) and slightly better than the ophthalmology trainees (59%), but worse than the ophthalmologists (76% on average, with the highest mark being 90%).
As Dr Arun Thirunavukarasu, who led the study, suggests in the pull quote below, one could speculate from this that language models won't replace specialists, but may one day be suitable for a role in triage – determining as a first port of call whether a case is serious enough to be referred to a specialist for an expert opinion, in the same way as a GP does currently.
On the other hand, it strikes me that multiple-choice textbook questions are likely to play to GPT-4's strengths. The researchers note that the questions weren't used as part of the language model's training data, but nonetheless it wouldn't be surprising if other textbooks had similar questions that the AI might have encountered during training. And perhaps more obviously, receiving a written summary of a patient's condition is very different from being confronted with a real-life patient and examining their eyes yourself, which a language model wouldn't seem to stand much chance of doing. Textbook questions are usually written in a way that's intended to lead the reader towards the correct answer, whereas in reality everyone's eyes are different and human judgment would seem an essential ingredient in a correct diagnosis. As the authors note, “Examination performance is an unvalidated indicator of clinical aptitude”.
Still, this may be a hint at what could one day be possible as AI continues to develop.