Large language model-assisted structured reporting in Radiology residents: an implementation pilot study in Emergency Radiology.

Authors

  • Clemente García Hidalgo Hospital Morales Meseguer, Murcia https://orcid.org/0009-0001-6672-2714
  • José Antonio Consentino Hernández Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España. https://orcid.org/0009-0000-3350-3278
  • José Vicente Cayuela Espí Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España. https://orcid.org/0009-0007-4055-6491
  • Gonzalo Pagán Vicente Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España.
  • Juana María Plasencia Martínez Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España.
  • Ana Blanco Barrio Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España. https://orcid.org/0000-0001-6448-1972
  • Gloria Pérez Hernández Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España. https://orcid.org/0000-0002-0974-8942
  • Ana Moreno Pastor Servicio de Radiología, Hospital Universitario Morales Meseguer, Murcia, España. https://orcid.org/0000-0003-2498-3489
DOI: https://doi.org/10.6018/edumed.695571
Keywords: AI, Artificial Intelligence, LLM, Structured reporting, Medical Education, Radiology, Residents, Emergency Radiology, Pilot study

Abstract

Objective: To evaluate whether a structured reporting system assisted by Large Language Models (LLMs) can be practically integrated into the work of radiology residents during on-call shifts. Secondary objectives included: describing format preferences through blind evaluation, characterizing linguistic differences between manual and LLM-assisted reports, and identifying perceived risks for a confirmatory study. Methods: A two-component pilot study was conducted. In the prospective phase, four residents generated 480 reports, alternating between manual and LLM-assisted writing (Custom GPT-4o). In parallel, 200 anonymized reports from attending physicians were analyzed to contextualize the metrics. An ad hoc Likert-type survey (six dimensions) was used, and classification and perplexity metrics were calculated as descriptive indicators. Results: The tool was well received. Median Likert scores ranged from 4.75 to 4.90 out of 5. Residents accurately distinguished which reports had been assisted (F1 = 0.92), suggesting a recognizable formal signature. Self-attribution bias was observed in blinded preferences. Perplexity differed between residents and attending physicians (p = 0.03), suggesting greater regularity among experienced professionals. Conclusions: The findings support the initial integration of the assistant into the on-call system. The value lies in its scaffolding function to standardize communication between residents and requesting physicians, not in automating diagnostic reasoning.

Downloads

Download data is not yet available.
Metrics
Views/Downloads
  • Abstract
    58
  • pdf (Español (España))
    28
  • pdf
    28

References

1. Kahn CE Jr. Artificial intelligence in radiology: decision support systems. Radiographics. 1994, 14, 849-861. https://doi.org/10.1148/radiographics.14.4.7938772

2. Rajpurkar P, Lungren MP. The current and future state of AI interpretation of medical images. N Engl J Med. 2023, 388, 1981-1990. https://doi.org/10.1056/NEJMra2301725

3. Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems. 2020, 33, 1877-1901. https://doi.org/10.48550/arXiv.2005.14165

4. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017, 30, 5998-6008. https://doi.org/10.48550/arXiv.1706.03762

5. Hartung MP, Bickle IC, Gaillard F, Kanne JP. How to Create a Great Radiology Report. RadioGraphics. 2020, 40, 1658-1670. https://doi.org/10.1148/rg.2020200020

6. Castro D, Mishra S, Kwan BY, et al. Structured Reporting in Radiology Residency: A Standardized Approach to Assessing Interpretation Skills and Competence. Int Med Educ. 2025, 4, 40. https://doi.org/10.3390/ime4010002

7. Larson DB, Towbin AJ, Pryor RM, Donnelly LF. Improving consistency in radiology reporting through the use of department-wide standardized structured reporting. Radiology. 2013, 267, 240-250. https://doi.org/10.1148/radiol.12121502

8. Kao JP, Kao HT. Large Language Models in radiology: A technical and clinical perspective. Eur J Radiol Artif Intell. 2025, 2, 100021. https://doi.org/10.1016/j.ejrai.2025.100021

9. Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol Artif Intell. 2020, 2, e200029. https://doi.org/10.1148/ryai.2020200029

10. Wirth S, Hebebrand J, Basilico R, et al. European Society of Emergency Radiology (ESER). Guideline on radiological polytrauma imaging and service (full version). Disponible en: https://www.eser-society.org/polytrauma-imaging-guidelines/ (Acceso: enero 2025).

11. Radiological Society of North America. RadReport Templates. Disponible en: https://www.rsna.org/practice-tools/data-tools-and-standards/radreport-templates (Acceso: enero 2025).

12. Sociedad Española de Radiología Médica. Léxico conflictivo en Radiología. Madrid: SERAM; 2020. Disponible en: https://static.seram.es/wp-content/uploads/2021/07/lexico_radiologico_conflictivo.pdf (Acceso: enero 2025).

13. Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems. 2020, 33, 9459-9474. https://doi.org/10.48550/arXiv.2005.11401

14. Lancaster GA, Dodd S, Williamson PR. Design and analysis of pilot studies: recommendations for good practice. J Eval Clin Pract. 2004, 10, 307-312. https://doi.org/10.1111/j..2002.384.doc.x

15. Brooke J. SUS: A 'quick and dirty' usability scale. En: Jordan PW, Thomas B, McClelland IL, Weerdmeester B, eds. Usability Evaluation in Industry. London: Taylor & Francis; 1996. p. 189-194.

16. Patel BN, Rosenberg L, Willcox G, et al. Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ Digit Med. 2019, 2, 111. https://doi.org/10.1038/s41746-019-0189-7

17. Wood D, Bruner JS, Ross G. The role of tutoring in problem solving. J Child Psychol Psychiatry. 1976, 17, 89-100. https://doi.org/10.1111/j.1469-7610.1976.tb00381.x

18. Ten Cate O. Entrustability of professional activities and competency-based training. Med Educ. 2005, 39, 1176-1177. https://doi.org/10.1111/j.1365-2929.2005.02341.x

19. Epstein RM. Assessment in medical education. N Engl J Med. 2007, 356, 387-396. https://doi.org/10.1056/NEJMra054784

20. Schwartz LH, Panicek DM, Berk AR, et al. Improving communication of diagnostic radiology findings through structured reporting. Radiology. 2011, 260, 174-181. https://doi.org/10.1148/radiol.11101913

21. Busch F, Hoffmann L, Pinto dos Santos D, et al. Large language models for structured reporting in radiology: past, present, and future. Eur Radiol. 2025, 35, 2589-2602. https://doi.org/10.1007/s00330-024-11107-6

22. Lindholz M, Burdenski A, Ruppel R, et al. Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports. Eur J Radiol. 2025, 191, 112316. https://doi.org/10.1016/j.ejrad.2025.112316

23. Martín-Noguerol T, López-Úbeda P, Luna A. From GPS to ChatGPT in Radiology... Dumb and Dumber? J Am Coll Radiol. 2025. https://doi.org/10.1016/j.jacr.2025.09.014

24. European Society of Radiology. ESR paper on structured reporting in radiology. Insights Imaging. 2018, 9, 1-7. https://doi.org/10.1007/s13244-017-0588-8

Published
19-01-2026
How to Cite
García Hidalgo, C., Consentino Hernández, J. A., Cayuela Espí, J. V., Pagán Vicente, G., Plasencia Martínez, J. M., Blanco Barrio, A., … Moreno Pastor, A. (2026). Large language model-assisted structured reporting in Radiology residents: an implementation pilot study in Emergency Radiology. Spanish Journal of Medical Education, 7(1). https://doi.org/10.6018/edumed.695571