Máquina contra máquina: Modelos de Lenguaje de Gran Escala (LLM) en Exámenes de Alto Riesgo de Aprendizaje Automático Aplicado con apuntes
Resumen
Existe un importante vacío en la Investigación de Educación en Computación (CER) sobre el impacto de Modelos de Lenguaje de Gran Escala (LLM) en etapas avanzadas de estudios de grado. Este artículo trata de cubrir este vacío investigando la efectividad de las LLM respondiendo preguntas de examen de Aprendizaje Automático Aplicado en último curso de Grado.
El estudio examina el desempeño de las LLM al responder a una variedad de preguntas de examen, que incluyen modelos de examen diseñados con y sin apuntes, a varios niveles de la Taxonomía de Bloom. Los formatos de pregunta incluyen de respuesta abierta, basadas en tablas, o en figuras.
Para conseguir esta meta, este estudio tiene los siguientes objetivos:
Análisis Comparativo: Comparar respuestas generadas por LLM y por estudiantes para juzgar el desempeño de las LLM.
Evaluación de Detectores: Evaluar la eficacia de diferentes detectores de LLM. Además, juzgar la eficacia de los detectores sobre texto alterado por alumnos con el objetivo de engañar a los detectores.
El método investigador de este artículo incorpora una relación entre seis alumnos y ocho profesores. Los estudiantes juegan un rol integral para determinar la dirección del proyecto, en especial en áreas poco conocidas para el profesorado, como el uso de herramientas de detección de LLM.
Este estudio contribuye a entender el rol de las LLM en el ámbito de la educación universitaria, con implicaciones para el diseño de futuros curriculums y técnicas de evaluación.
Descargas
Citas
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., & Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of AI code generation. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 500–506. https://doi.org/10.1145/3545945.35697599
Becker, B. A., Denny, P., Pettit, R., Bouchard, D., Bouvier, D. J., Harrington, B., Kamil, A., Karkare, A., McDonald, C., Osera, P.-M., Pearce, J. L., & Prather, J. (2019). Compiler error messages considered unhelpful: The landscape of text-based programming error message research. Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education, 177–210. https://doi.org/10.1145/3344429.33725088
Bernabei, M., Colabianchi, S., Falegnami, A., & Costantino, F. (2023). Students’ use of large language models in engineering education: A case study on technology acceptance, perceptions, efficacy, and detection chances. Computers and Education: Artificial Intelligence, 5, 100172.
Biderman, S., & Raff, E. (2022). Fooling moss detection with pretrained language models. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2933–2943. https://doi.org/10.1145/3511808.35570799
Deloatch, R., Bailey, B. P., & Kirlik, A. (2016). Measuring effects of modality on perceived test anxiety for computer programming exams. SIGCSE ’16 Proceedings of the 47th ACM Technical Symposium on Computing Science Education, 291–296. https://doi.org/10.1145/2839509.28446044
Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B. N., Santos, E. A., & Sarsa, S. (2024). Computing education in the era of generative AI. Communications of the ACM, 67(2), 56–67. https://doi.org/10.1145/36247200
de Raadt, M. (2012). Student created cheat-sheets in examinations: Impact on student outcomes. Proceedings of the Fourteenth Australasian Computing Education Conference, 71–76. http://dl.acm.org/citation.cfm?id=2483716.24837255
Dooley, B., O’Connor Cliodhna, C., Fitzgerald, A., & O'Reilly, A. (2019). My world survey 2: The national study of youth mental health in Ireland.
Eilertsen, T. V., & Valdermo, O. (2000). Open-book assessment: A contribution to improved learning? Studies in Educational Evaluation, 26(2), 91–103. https://doi.org/10.1016/S0191-491X(00)00010-9
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022). The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. Proceedings of the 24th Australasian Computing Education Conference, 10–19. https://doi.org/10.1145/3511861.35118633
Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E. A., Prather, J., & Becker, B. A. (2023). My AI wants to know if this will be on the exam: Testing OpenAI’s Codex on CS2 programming exercises. Proceedings of the 25th Australasian Computing Education Conference, 97–104. https://doi.org/10.1145/3576123.35761344
Harrington, K., Flint, A., Healey, M., et al. (2014). Engagement through partnership: Students as partners in learning and teaching in higher education. Higher Education Academy
Karvelas, I., Li, A., & Becker, B. A. (2020). The effects of compilation mechanisms and error message presentation on novice programmer behavior. Proceedings of the 51st ACM Technical Symposium on Computer Science Education, 759–765. https://doi.org/10.1145/3328778.33668822
Kazemitabaar, M., Hou, X., Henley, A., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. Proceedings of the 23rd Koli Calling Conference on Computing Education Research.
Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., & Hellas, A. (2023). Comparing code explanations created by students and large language models. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1
Leinonen, J., Hellas, A., Sarsa, S., Reeves, B., Denny, P., Prather, J., & Becker, B. A. (2023). Using large language models to enhance programming error messages. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 563–569. https://doi.org/10.1145/3545945.35697700
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A. D., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., ... Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://doi.org/10.1126/science.abq11588
Lortie-Forgues, H., & Inglis, M. (2019). Rigorous large-scale educational RCTs are often uninformative: Should we be concerned? Educational Researcher, 48(3), 158–166.
MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., & Leinonen, J. (2023). Experiences from using code explanations generated by large language models in a web software development e-book. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937. https://doi.org/10.1145/3545945.35697855
Nicks, C., Mitchell, E., Rafailov, R., Sharma, A., Manning, C. D., Finn, C., & Ermon, S. (2023, October). Language model detectors are easily optimized against. In The Twelfth International Conference on Learning Representations.
Nolan, K., & Bergin, S. (2016). The role of anxiety when learning to program: A systematic review of the literature. Proceedings of the 16th Koli Calling International Conference on Computing Education Research, 61–70. https://doi.org/10.1145/2999541.29995577
Nolan, K., Bergin, S., & Mooney, A. (2019). An insight into the relationship between confidence, self-efficacy, anxiety and physiological responses in a CS1 exam-like scenario. Proceedings of the 1st UK & Ireland Computing Education Research Conference, 8(1-8), 1–7. https://doi.org/10.1145/3351287.33512966
Nolan, K., Mooney, A., & Bergin, S. (2015). Facilitating student learning in computer science: Large class sizes and interventions. International Conference on Engaging Pedagogy.
Nolan, K., Mooney, A., & Bergin, S. (2019a). An investigation of gender differences in computer science using physiological, psychological and behavioural metrics. Proceedings of the Twenty-First Australasian Computing Education Conference, 47–55. https://doi.org/10.1145/3286960.32869666
Nolan, K., Mooney, A., & Bergin, S. (2019b). A picture of mental health in first year computer science. Proceedings of the 10th International Conference on Computer Science Education: Innovation and Technology.
Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Caspersen, M. E., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). Transformed by transformers: Navigating the AI coding revolution for computing education: An ITiCSE Working Group conducted by humans. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2, 561–562. https://doi.org/10.1145/3587103.35942066
Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). The robots are here: Navigating the generative AI revolution in computing education. Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education, 108–159.
Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1). https://doi.org/10.1145/3617367
Quille, K., & Bergin, S. (2015). Programming: Factors that influence success revisited and expanded. International Conference on Engaging Pedagogy (ICEP), 3rd and 4th December, College of Computing Technology, Dublin, Ireland.
Quille, K. (2019). Predicting and improving performance on introductory programming courses (CS1) (Doctoral dissertation). National University of Ireland Maynooth.
Quille, K., Bergin, S., & Quille, K. (2019). CS1: how will they do? How can we help? A decade of research and practice research and practice. Computer Science Education, 29(2-3), 254–282. https://doi.org/10.1080/08993408.2019.16126799
Quille, K., Nolan, K., Becker, B. A., & McHugh, S. (2021). Developing an open-book online exam for final year students. Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, 338–344. https://doi.org/10.1145/3430665.34563733
Quille, K., Nam Liao, S., Costelloe, E., Nolan, K., Mooney, A., & Shah, K. (2022). PreSS: Predicting Student Success Early in CS1. A Pilot International Replication and Generalization Study. Proceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (ITiCSE '22). Association for Computing Machinery, New York, NY, USA, 54–60. https://doi.org/10.1145/3502718.3524755
Quille, K., Nolan, K., McHugh, S., & Becker, B. A. (2020). Associated exam papers and module descriptors. Available at: http://tiny.cc/ITiCSE21OpenBook
Ribeiro, F., de Macedo, J. N. C., Tsushima, K., Abreu, R., & Saraiva, J. (2023). GPT-3-Powered type error debugging: Investigating the use of large language models for code repair. Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering (pp. 111–124). https://doi.org/10.1145/3623476.3623522
Santos, E. A., Prasad, P., & Becker, B. A. (2023). Always provide context: The effects of code context on programming error message enhancement. Proceedings of the ACM Conference on Global Computing Education Vol 1 (pp. 147–153).
Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (pp. 78–92).
Schrier, J. (2024). Comment on "Comparing the Performance of College Chemistry Students with ChatGPT for Calculations Involving Acids and Bases". Journal of Chemical Education.
Shields, C. (2023). ChatGPT for teachers and students. Ingram Content Goup UK
Sonkar, S., Chen, X., Le, M., Liu, N., Basu Mallick, D., & Baraniuk, R. (2024). Code soliloquies for accurate calculations in large language models. In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp. 828-835).
Trochim, W. M. (2006). Types of reliability. research methods knowledge base. Web Center for Social Research Methods. Retrieved from: http://www.socialresearchmethods.net/kb/reltypes.php
Wermelinger, M. (2023). Using GitHub Copilot to solve simple programming problems. Proceedings of the 54th ACM Technical Symposium on Computer Science Education (pp. 172–178). https://doi.org/10.1145/3545945.3569830
Derechos de autor 2024 Revista de Educación a Distancia (RED)
Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial 4.0.
Las obras que se publican en esta revista están sujetas a los siguientes términos:
1. El Servicio de Publicaciones de la Universidad de Murcia (la editorial) conserva los derechos patrimoniales (copyright) de las obras publicadas, y favorece y permite la reutilización de las mismas bajo la licencia de uso indicada en el punto 2.
2. Las obras se publican en la edición electrónica de la revista bajo una licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 3.0 España (texto legal). Se pueden copiar, usar, difundir, transmitir y exponer públicamente, siempre que: i) se cite la autoría y la fuente original de su publicación (revista, editorial y URL de la obra); ii) no se usen para fines comerciales; iii) se mencione la existencia y especificaciones de esta licencia de uso.
3. Condiciones de auto-archivo. Se permite y se anima a los autores a difundir electrónicamente las versiones pre-print (versión antes de ser evaluada) y/o post-print (versión evaluada y aceptada para su publicación) de sus obras antes de su publicación, ya que favorece su circulación y difusión más temprana y con ello un posible aumento en su citación y alcance entre la comunidad académica. Color RoMEO: verde.