Frontier Large Language Models on the 2026 Spanish MIR Examination: A Multimodal Cross-Sectional Evaluation.
Abstract
Introduction: Large language model-based AI systems are increasingly used in medical education, but their educational value in multilingual, high-stakes examination settings remains insufficiently defined. Objective: To assess the accuracy of latest-generation AI systems on the official 2026 Spanish Médico Interno Residente (MIR) examination and compare their performance with the Healthcademia chatbot, AMIR faculty experts, and average candidates. Materials and Methods: A cross-sectional quantitative evaluation analyzed all 200 valid questions from the official MIR 2026 Version 0 booklet. Performance was assessed using automated pipelines for text-only items and manual multimodal input for the 24 image-associated questions. Two instruction conditions, neutral and expert-role, were tested, with additional effort-level stratification for GPT-5.2. Accuracy was measured against the final official answer key. Human comparison groups included AMIR faculty experts and the average MIR candidate. Results: Several latest-generation AI systems showed very high accuracy on the MIR 2026 examination. The highest-performing configuration, GPT-5.2 high effort without expert-role instruction, achieved 199/200 correct answers (99.5%). Gemini 3 Flash with expert-role instruction achieved 198/200 (99.0%). AMIR faculty experts achieved 194/200 (97.0%), whereas the average candidate achieved 131/200 (65.7%). On image-associated questions, several multimodal configurations achieved 24/24 correct answers when the corresponding image was provided. Conclusions: Under the conditions evaluated in this study, several AI configurations achieved near-perfect accuracy on a complete, high-stakes national medical licensing examination. These findings support licensing-style examinations as benchmarks for educational AI and suggest potential use in supervised feedback and self-assessment. Performance on multiple-choice items should not be interpreted as evidence of autonomous clinical reasoning.
Downloads
-
Abstract0
-
pdf0
-
xml (Español (España))0
References
1. Carrasco JP, García E, Sánchez DA, Porter E, De La Puente L, Navarro J, Cerame A. ¿Es Capaz “ChatGPT” de Aprobar El Examen MIR de 2022? Implicaciones de La Inteligencia Artificial En La Educación Médica En España. Rev Esp Edu Med 2023, 4, https://doi.org/10.6018/edumed.556511
2. Cerame A, Juaneda J, Estrella-Porter P, De La Puente L, Navarro J, García E, Sánchez DA, Carrasco JP. ¿Es Capaz GPT-4 de Aprobar El MIR 2023? Comparativa Entre GPT-4 y ChatGPT-3 En Los Exámenes MIR 2022 y 2023. Rev Esp Edu Med 2024, 5, https://doi.org/10.6018/edumed.604091.
3. Leis A, Mayer M-A, Mayer A. Bridging AI and Medical Expertise: ChatGPT’s Success on the Medical Specialization Residency Admission Exam in Spain. In Studies in Health Technology and Informatics; Andrikopoulou E, Gallos P, Arvanitis TN, Austin R, Benis A, Cornet R, Chatzistergos P, Dejaco A, Dusseljee-Peute L, Mohasseb A, Natsiavas P, Nakkas H, Scott P, Eds.; IOS Press, 2025. ISBN 978-1-64368-596-0.
4. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024, 26, e60807, https://doi.org/10.2196/60807.
5. Benito P, Isla-Jover M, González-Castro P, Fernández Esparcia PJ, Carpio M, Blay-Simón I, Gutiérrez-Bedia P, Lapastora MJ, Carratalá B, Carazo-Casas C. GPT-4o and OpenAI O1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study. JMIR Med Educ 2026, 12, e75452–e75452, https://doi.org/10.2196/75452.
6. Ministry of Health of Spain. Specialized Healthcare Training. Madrid: Ministry of Health of Spain, 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/escritorio (accessed on 31 May 2026).
7. Ministry of Health of Spain. Order SND/928/2025, of 14 August, Approving the Offer of Places and the Call for 2025 Selective Tests for Access in 2026 to Specialized Health Training Places for University Degree/Bachelor’s/Diploma Programmes in Medicine, Pharmacy, Nursing and in the Fields of Psychology, Chemistry, Biology and Physics. 2025. Available online: https://www.boe.es/diario_boe/txt.php?id=BOE-A-2025-17059 (accessed on 31 May 2026).
8. Google AI for Developers. Gemini 3 Developer Guide. Google AI for Developers, 2026.
9. Anthropic. Models Overview — Claude API Docs. 2026. Available online: https://platform.claude.com/docs/en/about-claude/models/overview (accessed on 31 May 2026).
10. OpenAI. Introducing GPT-5.2. 2025. Available online: https://openai.com/es-ES/index/introducing-gpt-5-2/ (accessed on 31 May 2026).
11. Ministry of Health of Spain. Exam Booklets — Previous Calls. Specialized Healthcare Training. 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/datosAnteriores/cuadernosExamen (accessed on 31 May 2026).
12. Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D. Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning. Sci Rep 2025, 15, 39426, https://doi.org/10.1038/s41598-025-22940-0.
13. Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Large Language Models Lack Essential Metacognition for Reliable Medical Reasoning. Nat Commun 2025, 16, 642, https://doi.org/10.1038/s41467-024-55628-6.
14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, Clark K, Pfohl SR, Cole-Lewis H, et al. Toward Expert-Level Medical Question Answering with Large Language Models. Nat Med 2025, 31, 943–950, https://doi.org/10.1038/s41591-024-03423-7.
15. Chen X, Xiang J, Lu S, Liu Y, He M, Shi D. Evaluating Large Language Models and Agents in Healthcare: Key Challenges in Clinical Applications. Intelligent Medicine 2025, 5, 151–163, https://doi.org/10.1016/j.imed.2025.03.002.
16. Nam Y, Kim DY, Kyung S, Seo J, Song JM, Kwon J, Kim J, Jo W, Park H, Sung J, et al. Multimodal Large Language Models in Medical Imaging: Current State and Future Directions. Korean J Radiol 2025, 26, 900, https://doi.org/10.3348/kjr.2025.0599.
17. Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025, 13, e66917–e66917, https://doi.org/10.2196/66917.
18. Savage T, Wang J, Gallo R, Boukil A, Patel V, Safavi-Naini SAA, Soroush A, Chen JH. Large Language Model Uncertainty Proxies: Discrimination and Calibration for Medical Diagnosis and Treatment. Journal of the American Medical Informatics Association 2025, 32, 139–149, https://doi.org/10.1093/jamia/ocae254.
19. Bentegeac R, Le Guellec B, Kuchcinski G, Amouyel P, Hamroun A. Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study. J Med Internet Res 2025, 27, e64348–e64348, https://doi.org/10.2196/64348.
20. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clinics and Practice 2023, 13, 1460–1487, https://doi.org/10.3390/clinpract13060130.
21. Vera CL, Picon IF, Nunez MT del V, Gandia JAG, Ancillo A de L, Arroyo VR, Figueredo CM. Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025: A Comparative Analysis of Clinical Reasoning and Knowledge Application. ArXiv Preprint 2025, arXiv:2503.00025. https://arxiv.org/abs/2503.00025
Copyright (c) 2026 Servicio de Publicaciones de la Universidad de Murcia

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
The works published in this magazine are subject to the following terms:
1. The Publications Service of the University of Murcia (the publisher) preserves the economic rights (copyright) of the published works and favors and allows them to be reused under the use license indicated in point 2.
2. The works are published under a Creative Commons Attribution-NonCommercial-NoDerivative 4.0 license.
3. Self-archiving conditions. Authors are allowed and encouraged to disseminate electronically the pre-print versions (version before being evaluated and sent to the journal) and / or post-print (version evaluated and accepted for publication) of their works before publication , since it favors its circulation and earlier diffusion and with it a possible increase in its citation and reach among the academic community.





This is a Diamond Journal 





