Frontier Large Language Models on the 2026 Spanish MIR Examination: A Multimodal Cross-Sectional Evaluation.

Manuel Carpio Salmerón; Carlos Carazo-Casas; Pau Benito; Clemente Garcia; Jesús Alonso-Carrillo; Beatriz  Carratalá; Georgios Kyriakos; Pablo González-Castro

doi:10.6018/edumed.716401

Autores/as

Manuel Carpio Salmerón Graduado en Medicina https://orcid.org/0009-0009-8772-4565
Carlos Carazo-Casas Department of Otolaryngology, Ramón y Cajal Hospital, Madrid, Spain https://orcid.org/0000-0001-7568-7140
Pau Benito Department of Preventive Medicine and Epidemiology, Clinical Institute of Medicine and Dermatology (ICMiD), Hospital Clínic de Barcelona, Barcelona, Spain https://orcid.org/0000-0002-2480-9133
Clemente Garcia Department of Radiology, Hospital Morales Meseguer, Murcia, Spain. https://orcid.org/0009-0001-6672-2714
Jesús Alonso-Carrillo Department of Internal Medicine, Hospital 12 de Octubre, Madrid, Spain https://orcid.org/0000-0003-4910-1497
Beatriz Carratalá Innovation and Digital Projects Academic Department, Healthcademia, Madrid, Spain
Georgios Kyriakos Department of Endocrinology and Nutrition, Santa Lucía University General Hospital, Cartagena, Spain. https://orcid.org/0000-0002-2459-8655
Pablo González-Castro Department of Plastic and Reconstructive Surgery, Virgen del Rocío University Hospital, Sevilla, Spain https://orcid.org/0009-0003-0077-126X

DOI: https://doi.org/10.6018/edumed.716401

Palabras clave: Academic Research, Medical Education, Medical residents, MIR examination, Artificial Intelligence

Resumen

Introducción: Las herramientas de inteligencia artificial basadas en modelos de lenguaje se utilizan cada vez más en educación médica, pero su valor educativo en exámenes multilingües y de alto impacto sigue estando insuficientemente definido. Objetivo: Evaluar la precisión de herramientas de IA de última generación en el examen oficial Médico Interno Residente (MIR) español de 2026 y comparar su rendimiento con el chatbot educativo de Healthcademia, profesores de academia MIR y el candidato medio. Materiales y métodos: Una evaluación cuantitativa transversal analizó las 200 preguntas válidas del cuadernillo oficial Versión 0 del MIR 2026. El rendimiento se evaluó mediante procesos automatizados para los ítems de solo texto y entrada multimodal manual para las 24 preguntas asociadas a imagen. Se probaron dos condiciones de instrucciones a la IA, neutra y de rol experto, con estratificación adicional por nivel de esfuerzo para GPT-5.2. La precisión se midió frente a la plantilla oficial final de respuestas. Los grupos de comparación humanos incluyeron profesores de academia MIR y el candidato medio. Resultados: Varias herramientas de IA de última generación mostraron una precisión muy elevada en el examen MIR 2026. La configuración con mejor rendimiento, GPT-5.2 alto esfuerzo sin instrucción de rol experto, obtuvo 199/200 respuestas correctas (99,5 %). Gemini 3 Flash con instrucción de rol experto obtuvo 198/200 (99,0 %). Los profesores de academia MIR obtuvieron 194/200 (97,0 %), mientras que el candidato medio obtuvo 131/200 (65,7 %). En las preguntas asociadas a imagen, varias configuraciones multimodales obtuvieron 24/24 respuestas correctas cuando se proporcionó la imagen correspondiente. Conclusiones: En las condiciones evaluadas, varias herramientas de IA de última generación alcanzaron una precisión casi perfecta en un examen nacional de acceso a la residencia médica. Estos hallazgos apoyan el valor de los exámenes habilitantes como referentes para evaluar herramientas de inteligencia artificial aplicadas a la educación médica y sugieren utilidad para retroalimentación y autoevaluación supervisadas. El rendimiento en preguntas de opción múltiple no debe interpretarse como prueba de razonamiento clínico autónomo.

Descargas

Los datos de descargas todavía no están disponibles.

Metrics

Vistas/Descargas

Resumen
11
pdf
15
xml
0

Citas

1. Carrasco JP, García E, Sánchez DA, Porter E, De La Puente L, Navarro J, Cerame A. ¿Es Capaz “ChatGPT” de Aprobar El Examen MIR de 2022? Implicaciones de La Inteligencia Artificial En La Educación Médica En España. Rev Esp Edu Med 2023, 4, https://doi.org/10.6018/edumed.556511

2. Cerame A, Juaneda J, Estrella-Porter P, De La Puente L, Navarro J, García E, Sánchez DA, Carrasco JP. ¿Es Capaz GPT-4 de Aprobar El MIR 2023? Comparativa Entre GPT-4 y ChatGPT-3 En Los Exámenes MIR 2022 y 2023. Rev Esp Edu Med 2024, 5, https://doi.org/10.6018/edumed.604091.

3. Leis A, Mayer M-A, Mayer A. Bridging AI and Medical Expertise: ChatGPT’s Success on the Medical Specialization Residency Admission Exam in Spain. In Studies in Health Technology and Informatics; Andrikopoulou E, Gallos P, Arvanitis TN, Austin R, Benis A, Cornet R, Chatzistergos P, Dejaco A, Dusseljee-Peute L, Mohasseb A, Natsiavas P, Nakkas H, Scott P, Eds.; IOS Press, 2025. ISBN 978-1-64368-596-0.

4. Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024, 26, e60807, https://doi.org/10.2196/60807.

5. Benito P, Isla-Jover M, González-Castro P, Fernández Esparcia PJ, Carpio M, Blay-Simón I, Gutiérrez-Bedia P, Lapastora MJ, Carratalá B, Carazo-Casas C. GPT-4o and OpenAI O1 Performance on the 2024 Spanish Competitive Medical Specialty Access Examination: Cross-Sectional Quantitative Evaluation Study. JMIR Med Educ 2026, 12, e75452–e75452, https://doi.org/10.2196/75452.

6. Ministry of Health of Spain. Specialized Healthcare Training. Madrid: Ministry of Health of Spain, 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/escritorio (accessed on 31 May 2026).

7. Ministry of Health of Spain. Order SND/928/2025, of 14 August, Approving the Offer of Places and the Call for 2025 Selective Tests for Access in 2026 to Specialized Health Training Places for University Degree/Bachelor’s/Diploma Programmes in Medicine, Pharmacy, Nursing and in the Fields of Psychology, Chemistry, Biology and Physics. 2025. Available online: https://www.boe.es/diario_boe/txt.php?id=BOE-A-2025-17059 (accessed on 31 May 2026).

8. Google AI for Developers. Gemini 3 Developer Guide. Google AI for Developers, 2026.

9. Anthropic. Models Overview — Claude API Docs. 2026. Available online: https://platform.claude.com/docs/en/about-claude/models/overview (accessed on 31 May 2026).

10. OpenAI. Introducing GPT-5.2. 2025. Available online: https://openai.com/es-ES/index/introducing-gpt-5-2/ (accessed on 31 May 2026).

11. Ministry of Health of Spain. Exam Booklets — Previous Calls. Specialized Healthcare Training. 2026. Available online: https://fse.sanidad.gob.es/fseweb/#/principal/datosAnteriores/cuadernosExamen (accessed on 31 May 2026).

12. Kim J, Podlasek A, Shidara K, Liu F, Alaa A, Bernardo D. Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning. Sci Rep 2025, 15, 39426, https://doi.org/10.1038/s41598-025-22940-0.

13. Griot M, Hemptinne C, Vanderdonckt J, Yuksel D. Large Language Models Lack Essential Metacognition for Reliable Medical Reasoning. Nat Commun 2025, 16, 642, https://doi.org/10.1038/s41467-024-55628-6.

14. Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, Hou L, Clark K, Pfohl SR, Cole-Lewis H, et al. Toward Expert-Level Medical Question Answering with Large Language Models. Nat Med 2025, 31, 943–950, https://doi.org/10.1038/s41591-024-03423-7.

15. Chen X, Xiang J, Lu S, Liu Y, He M, Shi D. Evaluating Large Language Models and Agents in Healthcare: Key Challenges in Clinical Applications. Intelligent Medicine 2025, 5, 151–163, https://doi.org/10.1016/j.imed.2025.03.002.

16. Nam Y, Kim DY, Kyung S, Seo J, Song JM, Kwon J, Kim J, Jo W, Park H, Sung J, et al. Multimodal Large Language Models in Medical Imaging: Current State and Future Directions. Korean J Radiol 2025, 26, 900, https://doi.org/10.3348/kjr.2025.0599.

17. Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025, 13, e66917–e66917, https://doi.org/10.2196/66917.

18. Savage T, Wang J, Gallo R, Boukil A, Patel V, Safavi-Naini SAA, Soroush A, Chen JH. Large Language Model Uncertainty Proxies: Discrimination and Calibration for Medical Diagnosis and Treatment. Journal of the American Medical Informatics Association 2025, 32, 139–149, https://doi.org/10.1093/jamia/ocae254.

19. Bentegeac R, Le Guellec B, Kuchcinski G, Amouyel P, Hamroun A. Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study. J Med Internet Res 2025, 27, e64348–e64348, https://doi.org/10.2196/64348.

20. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W, Montejo R, Aguinaga-Ontoso E, Barach P, Aguinaga-Ontoso I. Evaluating the Efficacy of ChatGPT in Navigating the Spanish Medical Residency Entrance Examination (MIR): Promising Horizons for AI in Clinical Medicine. Clinics and Practice 2023, 13, 1460–1487, https://doi.org/10.3390/clinpract13060130.

21. Vera CL, Picon IF, Nunez MT del V, Gandia JAG, Ancillo A de L, Arroyo VR, Figueredo CM. Evaluating Large Language Models on the Spanish Medical Intern Resident (MIR) Examination 2024/2025: A Comparative Analysis of Clinical Reasoning and Knowledge Application. ArXiv Preprint 2025, arXiv:2503.00025. https://arxiv.org/abs/2503.00025