Limitations of large language model-generated multiple-choice questions in ophthalmology
As artificial intelligence (AI) becomes increasingly embedded in medical education, organizations are exploring the use of large language models (LLMs) to automate labor intensive tasks. In a recent JAMA Ophthalmology study, Gholami et al. reported on a pilot study which used OpenAI’s ChatGPT-4o to generate ophthalmology board-style multiple choice questions (MCQs) (1). Ten ophthalmologists rated 121 questions (64 were LLM-generated, 57 were human-written) across five domains (appropriateness, clarity and specificity, suitability for trainees, discriminative power, and relevance). Median scores for LLM- and human-generated MCQs were essentially indistinguishable across all assessment criteria, including readability, implying that GPT‑4o produced questions that resemble human-written questions. Furthermore, the LLM-generated items were unique: a similarity analysis against the American Academy of Ophthalmology (AAO) question bank used as input found that nearly 95% of LLMgenerated items had low similarity scores (1). While overall interrater agreement was moderate, domain‑level agreement was near chance. This article offers an important proof‑of‑concept: LLMs can rapidly produce new ophthalmology MCQs that, at a surface level, appear much like traditionally vetted human-authored questions, indicating a novel method of leveraging AI in medical assessment. However, it raises concerns about whether LLM‑generated MCQs can truly assess clinical competence, and whether—by bypassing established validation pipelines—we are simply scaling a fundamentally flawed assessment format faster than we can govern it.
LLMs as content experts?
The Ophthalmic Knowledge Assessment Program (OKAP) and ophthalmology board certification exams consist of MCQs, positioning this format as a major assessment tool in the specialty. LLMs would be a natural alternative tool to generate questions, as they have been useful for writing, editing, and summarizing medical content. Additionally, they have also been shown to have strong performance answering human-generated MCQs, with models achieving passing scores on the United States Medical Licensing Examination (USMLE) and the Family Medicine In-Training exam (a surrogate for first-time board certification pass rates) (2,3). These accomplishments, however, underscore a deeper concern: the extent to which LLMs can create and answer MCQs that truly capture clinically meaningful knowledge and reasoning. Conventional question development for high-stakes evaluations is intentionally slow and resource intensive, using repeated expert review to safeguard validity and reliability and to impel examinees to think critically to reach an answer (1). LLMs, in contrast, generate and answer questions by predicting the most probable next token in a sequence, and thus may not be able to handle more difficult questions requiring critical thinking. For instance, when Bedi et al. [2025] introduced a “None of the Above” option via perturbation testing, performance of all AI models on medical reasoning MCQs dropped substantially (4). This finding suggests that LLMs rely heavily on surface pattern recognition rather than robust clinical reasoning, and that speed and volume of item generation does not equate to educational quality. Absent rigorous validation, accelerated AI-assisted item creation risks weakening rather than strengthening assessment standards—essentially, garbage in, garbage out. The drive to scale LLM-generated content also exposes a profound absence of AI’s governance in medical education. Reviews of LLM use in medical education acknowledge potential reductions in educator workload but consistently warn of hallucinations, bias, and ethical concerns (5,6). Fundamental questions remain unanswered. Who owns AI-generated assessment content? Who audits errors? Who is accountable for flawed content?
Furthermore, the authors rightly note that GPT-4o’s knowledge is bounded by its training cutoff and that the model cannot incorporate real-time updates without explicit retraining or external tools. Studies of ChatGPT’s performance on the USMLE have similarly highlighted the limitations imposed by training cutoffs, as well as the challenges introduced by model updates and version drift (7). A model that performs well on a given exam today may behave differently after a parameter update, complicating longitudinal validity and reproducibility. In a rapidly evolving field like ophthalmology, training tomorrow’s physicians on yesterday’s knowledge is not innovation—it is regression at scale. While it is true that medical education content (including textbooks and question banks) often lags behind evidence-based medicine, this does not absolve LLM-based systems from the responsibility to keep pace with practice-changing data and guideline updates.
Automation without governance does not modernize education. It industrializes uncertainty. In clinical informatics, clinical decision support (CDS) tools are governed through version control, continuous monitoring, audit mechanisms, and safety oversight. If CDS requires safety oversight because it shapes clinical decisions, assessments that evaluate licensure and competence deserve the same rigor.
The illusion of equivalence
In Gholami et al. [2025], the authors surveyed 24 ophthalmologists (13 helped develop the OKAP and 11 were part of the Resident Assessment Committee) as the domain experts. Ten ophthalmologists responded and their clinical ophthalmology experience ranged from 1–28 years with a median of 6 years. Despite the breadth of experience, there were no significant differences across domains evaluated by expert validators between both human and AI-generated questions (1). It is both striking and concerning that reviewer agreement on the detailed strengths and weaknesses of individual questions was so low. This pattern resonates with long-standing feedback that current MCQ exams—whether for OKAPs or board certification—sometimes feel unfair, overly subtle, or idiosyncratic. If experts cannot consistently identify ambiguity, poor discriminative power, or misalignment with practice, it is unsurprising that residents experience variability in question quality as noise rather than signal of their medical knowledge. Crucially, this lack of agreement was similar for LLM-generated and human-written questions. The instability in ratings does not appear to arise from the generator (human versus model) but from the assessment framework and rating process itself. Before we anoint LLMs as equivalent to human experts, we must acknowledge that the yardstick we are using to make that comparison is itself poorly calibrated.
The authors also report a moderate intraclass correlation coefficient (ICC ~0.63) for overall ratings across questions, but Krippendorff’s alpha values for individual domains clustered around zero for both human- and LLM-generated items. Low alpha values suggest that the measured dimensions are not being interpreted in a stable, shared way by experts. In practical terms, this means expert raters were only marginally more consistent than chance when judging the strengths and weaknesses of specific questions. Only moderate agreement on aggregate scores, coupled with near-random agreement at the domain level, carries an important implication: even when raters appear to “agree” on a question’s overall quality, they may be converging on the wrong aspects. For example, a stem may be clinically sound but ambiguous, or discriminative but misaligned with actual practice.
Novelty without validity
The authors found that 95.31% of LLM-generated questions had a maximum similarity score <60 against the AAO question bank, and only a single exact duplicate was identified. This is an important contribution: it shows that when constrained by appropriate guardrails, LLMs can produce content that is not merely regurgitated from existing items. However, novelty does not guarantee validity, clinical accuracy, or educational value. Questions can be entirely “new” in wording yet still misrepresent current standards of care, hinge on an irrelevant subtlety, or test trivia rather than clinically meaningful reasoning. The danger in focusing on string similarity metrics alone is that we risk equating textual uniqueness with psychometric or pedagogical quality.
Assessment systems should therefore be optimized for construct fidelity, not textual uniqueness. For MCQs, that means anchoring item generation and reviewing explicit models of the underlying competencies being tested, tying each question to observable performance and patient-centered outcomes wherever possible. Future evaluations should also assess whether LLM-generated questions adequately capture the breadth and depth of ophthalmology. One approach is mapping questions to the American Board of Ophthalmology’s (ABO) content outline, ensuring common conditions as well as ophthalmologic emergencies are tested. Similarity scores may be a useful guardrail for reducing copyright risk or overt duplication, but they should remain secondary to scrutiny of what each question actually measures. Hallucinations are a well-documented limitation of LLMs. The authors provide limited opportunity for independent verification (only two LLM-generated questions with answer choices are available in Tabs. 1,2; a list of LLM-generated questions without answer choices are available in Tab. 3) and an informal review by our team suggests the provided questions are factually reasonable. However, a systematic evaluation of clinical accuracy across the full question set was not performed and represents an important gap. At the same time, we must recognize the broader ecosystem implications. We must embrace AI because it is already accessible to both exam designers and examinees—this study is an early exemplar of the use case we have been promised. But if a credentialing body can use LLMs to generate questions conditioned on copyrighted materials and proprietary templates, examinees can do something similar, even without sanctioned access. This raises uncomfortable but necessary questions. If anyone can produce exam-like questions using LLMs, is this still the best way to assess readiness for boards and, ultimately, for autonomous clinical practice? If the bottleneck shifts from generating stems to guarding item banks and controlling exposure, we may find ourselves locked into a fragile equilibrium: high-stakes decisions grounded in formats that technology has rendered increasingly easy to imitate and increasingly hard to secure.
Special considerations in ophthalmology
Ophthalmology is inherently visual. Many specialties in medicine rely on image-based questions to assess visual pattern recognition and diagnostic skill. Clinical competence in retina or glaucoma, for example, depends heavily on interpreting multimodal imaging, integrating subtle visual cues with patient history, and managing uncertainty over time. The current study evaluates text-based MCQs but does not examine image-based or multimodal questions. This is understandable given the state of the technology: contemporary LLMs still have clear limitations in generating and reasoning over high-fidelity clinical images, particularly when subtle anatomic or pathologic features determine the correct answer. Yet from an educational design perspective, this is a missed opportunity. By focusing solely on text, we reinforce a narrow slice of competence—verbal recall and verbal reasoning—while underemphasizing the visual and integrative skills that define ophthalmic practice. Even the most carefully engineered text MCQs may systematically under-assess key capabilities, such as rapid recognition of sight-threatening findings or the ability to synthesize cross-modality data (visual fields, optical coherence tomography, fundus photography) into a coherent plan.
Similarly, MCQs, even when generated by LLMs, are poorly suited to assessing surgical skill because they can only probe what a learner knows or recognizes, not what they can actually do. Surgical competence depends on psychomotor abilities, visuospatial judgment, intraoperative decision-making under uncertainty, and team communication—capacities that emerge in real time and in context. An examinee might select the correct answer about the next step in cataract surgery yet be unable to perform that maneuver safely, efficiently, or adaptively in the operating room. Conversely, a technically excellent surgeon may miss a subtly worded distractor without any implication for their hands-on performance. As a result, MCQs can support the assessment of procedural knowledge and preparation, but they cannot validly capture the embodied, performance-based aspects of surgical practice that matter most for patient outcomes.
The problem is not just how we generate questions—it is what we choose to measure. Scaling text-based MCQs with LLMs risks perfecting an assessment modality that only partially reflects the real work of ophthalmologists.
The missed opportunity: beyond efficiency and better precision in education
The study under discussion is an important proof-of-concept for LLM-based question generation, but it also highlights what is missing: a true learning health system for medical education. Such a system would treat assessment not as a static product but as a continuously improving service, tightly coupled to real performance and patient outcomes. This means prioritizing actionable knowledge over throughput. Trainee interactions should result in data that closes measurable learning gaps rather than simply accelerating question generation. Efficiency should be a downstream consequence of a well-functioning educational infrastructure rather than its primary objective. Two areas where LLMs can improve medical education outside of question generation are Objective Structured Clinical Examinations (OSCEs) and the grading of medical documentation (8,9). With sufficient prompt engineering, educators can develop AI-generated standardized patients based on real, deidentified clinical presentations for learners to interact with virtually. In clinical settings, LLMs can be used to automatically assess and improve resident physician clinical reasoning documentation, an area where resident physicians often receive limited structured feedback (9). A proposed, more ambitious architecture for LLM use in ophthalmology education is shown in Figure 1. This kind of iterative, data-driven pipeline parallels how health systems with advanced analytics approach high stakes decisions: not as a one-time implementation but as an ongoing endeavor requiring governance, monitoring, and adjustment. Reviews of LLM deployment in medical education and healthcare consistently advocate for continuous monitoring with human-in-the-loop, institutional governance, and feedback loops analogous to those used for CDS systems. Robustness-focused evaluation frameworks stress the need for ongoing reassessment of model behavior, not just one-time validation studies. Without informatics infrastructure—data pipelines, governance mechanisms, performance analytics—LLMs accelerate noise, not learning. They can increase the volume and variety of questions, but they cannot guarantee that those questions are meaningful, fair, or aligned with patient-centered outcomes.
Conclusions
The study by Gholami et al. demonstrates that LLMs can indeed write ophthalmology MCQs whose surface quality approximates that of expert-authored items (1). This is a genuine technical achievement, and it opens promising avenues for expanding and diversifying educational resources. But the hard problem in assessment is no longer writing questions. The real challenge is deciding which questions deserve to exist and proving that they actually improve education and, ultimately, patient care. To move from proof-of-concept to responsible deployment, several shifts are necessary. We need clear, operationalized definitions of item quality and construct validity, accompanied by rater training and calibration. Reliance on ad hoc Likert scales with near-zero interrater agreement is insufficient for benchmarking human versus machine performance. Organizations must establish formal governance structures for LLM-generated educational content, including policies for prompt design, model selection, data sources, human oversight, and decommissioning of problematic items. Item performance should be tied to meaningful outcomes: not only exam scores, but also clinical reasoning in simulations, patient safety metrics, and longitudinal practice patterns. Questions that do not predict clinical competence should be revised or retired, regardless of how polished they appear. LLM-generated assessments must be routinely audited for bias across gender, race, language background, and training environment. Finally, educational programs and certifying bodies need dedicated informatics leadership—professionals who understand both AI systems and assessment science. Without this expertise, we risk treating LLMs as black boxes that magically produce “more questions”, rather than as tools that must be embedded in a carefully engineered ecosystem.
Acknowledgments
Generative AI (OpenAI’s ChatGPT 5.2) was used to help develop the outline, refine sentence phrasing, and identify typographical errors.
Footnote
Provenance and Peer Review: This article was commissioned by the editorial office, Annals of Translational Medicine. The article has undergone external peer review.
Peer Review File: Available at https://atm.amegroups.com/article/view/10.21037/atm-2026-0043/prf
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://atm.amegroups.com/article/view/10.21037/atm-2026-0043/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Gholami S, Mummert DB, Wilson B, et al. Leveraging Large Language Models to Generate Multiple-Choice Questions for Ophthalmology Education. JAMA Ophthalmol 2025;143:955-61. [Crossref] [PubMed]
- Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023;9:e45312.
- Hanna RE, Smith LR, Mhaskar R, et al. Performance of Language Models on the Family Medicine In-Training Exam. Fam Med 2024;56:555-60. [Crossref] [PubMed]
- Bedi S, Jiang Y, Chung P, et al. Fidelity of Medical Reasoning in Large Language Models. JAMA Netw Open 2025;8:e2526021. [Crossref] [PubMed]
- Klang E, Portugez S, Gross R, et al. Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4. BMC Med Educ 2023;23:772.
- Vrdoljak J, Boban Z, Vilović M, et al. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare (Basel) 2025;13:603. [Crossref] [PubMed]
- Bicknell BT, Butler D, Whalen S, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ 2024;10:e63430. [Crossref] [PubMed]
- Malhotra K, Elkin Z. Enhancing ophthalmology OSCEs with generative artificial intelligence patient simulation and integration of the EHR. Presented at: NYU Grossman School of Medicine Tenth Annual Medical Education Innovations & Scholarship Conference; October 2025; New York, NY.
- Schaye V, Guzman B, Burk-Rafel J, et al. Development and Validation of a Machine Learning Model for Automated Assessment of Resident Clinical Reasoning Documentation. J Gen Intern Med 2022;37:2230-8. [Crossref] [PubMed]

