Frontier Large Language Models on the 2026 Spanish MIR Examination: A Multimodal Cross-Sectional Evaluation.

Manuel Carpio Salmerón; Carlos Carazo-Casas; Pau Benito; Clemente Garcia; Jesús Alonso-Carrillo; Beatriz  Carratalá; Georgios Kyriakos; Pablo González-Castro

doi:10.6018/edumed.716401

Authors

Manuel Carpio Salmerón Department of Endocrinology and Nutrition, Santa Lucía University General Hospital, Cartagena, Spain https://orcid.org/0009-0009-8772-4565
Carlos Carazo-Casas Department of Otolaryngology, Ramón y Cajal Hospital, Madrid, Spain https://orcid.org/0000-0001-7568-7140
Pau Benito Department of Preventive Medicine and Epidemiology, Clinical Institute of Medicine and Dermatology (ICMiD), Hospital Clínic de Barcelona, Barcelona, Spain https://orcid.org/0000-0002-2480-9133
Clemente Garcia Department of Radiology, Hospital Morales Meseguer, Murcia, Spain. https://orcid.org/0009-0001-6672-2714
Jesús Alonso-Carrillo Department of Internal Medicine, Hospital 12 de Octubre, Madrid, Spain https://orcid.org/0000-0003-4910-1497
Beatriz Carratalá Innovation and Digital Projects Academic Department, Healthcademia, Madrid, Spain
Georgios Kyriakos Department of Endocrinology and Nutrition, Santa Lucía University General Hospital, Cartagena, Spain. https://orcid.org/0000-0002-2459-8655
Pablo González-Castro Department of Plastic and Reconstructive Surgery, Virgen del Rocío University Hospital, Sevilla, Spain https://orcid.org/0009-0003-0077-126X

DOI: https://doi.org/10.6018/edumed.716401

Keywords: Academic Research, Medical Education, Medical residents, MIR examination, Artificial Intelligence

Abstract

Introduction: Large language model-based AI systems are increasingly used in medical education, but their educational value in multilingual, high-stakes examination settings remains insufficiently defined. Objective: To assess the accuracy of latest-generation AI systems on the official 2026 Spanish Médico Interno Residente (MIR) examination and compare their performance with the Healthcademia chatbot, AMIR faculty experts, and average candidates. Materials and Methods: A cross-sectional quantitative evaluation analyzed all 200 valid questions from the official MIR 2026 Version 0 booklet. Performance was assessed using automated pipelines for text-only items and manual multimodal input for the 24 image-associated questions. Two instruction conditions, neutral and expert-role, were tested, with additional effort-level stratification for GPT-5.2. Accuracy was measured against the final official answer key. Human comparison groups included AMIR faculty experts and the average MIR candidate. Results: Several latest-generation AI systems showed very high accuracy on the MIR 2026 examination. The highest-performing configuration, GPT-5.2 high effort without expert-role instruction, achieved 199/200 correct answers (99.5%). Gemini 3 Flash with expert-role instruction achieved 198/200 (99.0%). AMIR faculty experts achieved 194/200 (97.0%), whereas the average candidate achieved 131/200 (65.7%). On image-associated questions, several multimodal configurations achieved 24/24 correct answers when the corresponding image was provided. Conclusions: Under the conditions evaluated in this study, several AI configurations achieved near-perfect accuracy on a complete, high-stakes national medical licensing examination. These findings support licensing-style examinations as benchmarks for educational AI and suggest potential use in supervised feedback and self-assessment. Performance on multiple-choice items should not be interpreted as evidence of autonomous clinical reasoning.

Downloads

Download data is not yet available.

Metrics

Views/Downloads

Abstract
0
pdf
0
xml (Español (España))
0

References

1. Carrasco JP, García E, Sánchez DA, Porter E, De La Puente L, Navarro J, Cerame A. ¿Es Capaz “ChatGPT” de Aprobar El Examen MIR de 2022? Implicaciones de La Inteligencia Artificial En La Educación Médica En España. Rev Esp Edu Med 2023, 4, https://doi.org/10.6018/edumed.556511

2. Cerame A, Juaneda J, Estrella-Porter P, De La Puente L, Navarro J, García E, Sánchez DA, Carrasco JP. ¿Es Capaz GPT-4 de Aprobar El MIR 2023? Comparativa Entre GPT-4 y ChatGPT-3 En Los Exámenes MIR 2022 y 2023. Rev Esp Edu Med 2024, 5, https://doi.org/10.6018/edumed.604091.

3. Leis A, Mayer M-A, Mayer A. Bridging AI and Medical Expertise: ChatGPT’s Success on the Medical Specialization Residency Admission Exam in Spain. In Studies in Health Technology and Informatics; Andrikopoulou E, Gallos P, Arvanitis TN, Austin R, Benis A, Cornet R, Chatzistergos P, Dejaco A, Dusseljee-Peute L, Mohasseb A, Natsiavas P, Nakkas H, Scott P, Eds.; IOS Press, 2025. ISBN 978-1-64368-596-0.

4. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024, 26, e60807, https://doi.org/10.2196/60807.

5. Benito P, Isla-Jover M, González-Castro P, Fernández Esparcia PJ, Carpio M, Blay-Simón I, Gutiérrez-Bedia P, Lapastora MJ, Carratalá B, Carazo-Casas C. GPT-4o and OpenAI O1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study. JMIR Med Educ 2026, 12, e75452–e75452, https://doi.org/10.2196/75452.

6. Ministry of Health of Spain. Specialized Healthcare Training. Madrid: Ministry of Health of Spain, 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/escritorio (accessed on 31 May 2026).

7. Ministry of Health of Spain. Order SND/928/2025, of 14 August, Approving the Offer of Places and the Call for 2025 Selective Tests for Access in 2026 to Specialized Health Training Places for University Degree/Bachelor’s/Diploma Programmes in Medicine, Pharmacy, Nursing and in the Fields of Psychology, Chemistry, Biology and Physics. 2025. Available online: https://www.boe.es/diario_boe/txt.php?id=BOE-A-2025-17059 (accessed on 31 May 2026).

8. Google AI for Developers. Gemini 3 Developer Guide. Google AI for Developers, 2026.

9. Anthropic. Models Overview — Claude API Docs. 2026. Available online: https://platform.claude.com/docs/en/about-claude/models/overview (accessed on 31 May 2026).

10. OpenAI. Introducing GPT-5.2. 2025. Available online: https://openai.com/es-ES/index/introducing-gpt-5-2/ (accessed on 31 May 2026).

11. Ministry of Health of Spain. Exam Booklets — Previous Calls. Specialized Healthcare Training. 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/datosAnteriores/cuadernosExamen (accessed on 31 May 2026).

12. Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D. Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning. Sci Rep 2025, 15, 39426, https://doi.org/10.1038/s41598-025-22940-0.

13. Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Large Language Models Lack Essential Metacognition for Reliable Medical Reasoning. Nat Commun 2025, 16, 642, https://doi.org/10.1038/s41467-024-55628-6.

14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, Clark K, Pfohl SR, Cole-Lewis H, et al. Toward Expert-Level Medical Question Answering with Large Language Models. Nat Med 2025, 31, 943–950, https://doi.org/10.1038/s41591-024-03423-7.

15. Chen X, Xiang J, Lu S, Liu Y, He M, Shi D. Evaluating Large Language Models and Agents in Healthcare: Key Challenges in Clinical Applications. Intelligent Medicine 2025, 5, 151–163, https://doi.org/10.1016/j.imed.2025.03.002.

16. Nam Y, Kim DY, Kyung S, Seo J, Song JM, Kwon J, Kim J, Jo W, Park H, Sung J, et al. Multimodal Large Language Models in Medical Imaging: Current State and Future Directions. Korean J Radiol 2025, 26, 900, https://doi.org/10.3348/kjr.2025.0599.

17. Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025, 13, e66917–e66917, https://doi.org/10.2196/66917.

18. Savage T, Wang J, Gallo R, Boukil A, Patel V, Safavi-Naini SAA, Soroush A, Chen JH. Large Language Model Uncertainty Proxies: Discrimination and Calibration for Medical Diagnosis and Treatment. Journal of the American Medical Informatics Association 2025, 32, 139–149, https://doi.org/10.1093/jamia/ocae254.

19. Bentegeac R, Le Guellec B, Kuchcinski G, Amouyel P, Hamroun A. Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study. J Med Internet Res 2025, 27, e64348–e64348, https://doi.org/10.2196/64348.

20. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clinics and Practice 2023, 13, 1460–1487, https://doi.org/10.3390/clinpract13060130.

21. Vera CL, Picon IF, Nunez MT del V, Gandia JAG, Ancillo A de L, Arroyo VR, Figueredo CM. Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025: A Comparative Analysis of Clinical Reasoning and Knowledge Application. ArXiv Preprint 2025, arXiv:2503.00025. https://arxiv.org/abs/2503.00025