Machine vs Machine: Large Language Models (LLMs) in Applied Machine Learning High-Stakes Open-Book Exams

Keith Quille; Csanad Alattyanyi; Brett A. Becker; Roisin Faherty; Damien Gordon; Miriam Harte; Svetlana Hensman; Markus Hofmann; Jorge Jiménez García; Anthony Kuznetsov; Conrad Marais; Keith Nolan; Ciaran O'Leary; Cianan Nicolai; Andrzej Zero

doi:10.6018/red.603001

Autores/as

Keith Quille TU Dublin, Ireland https://orcid.org/0000-0002-1414-5142
Csanad Alattyanyi TU Dublin, Ireland https://orcid.org/0009-0007-8307-7765
Brett A. Becker UCD, Ireland https://orcid.org/0000-0003-1446-647X
Roisin Faherty TU Dublin, Ireland https://orcid.org/0000-0002-3906-1500
Damien Gordon TU Dublin, Ireland https://orcid.org/0000-0002-3875-4065
Miriam Harte TU Dublin, Ireland https://orcid.org/0009-0007-9334-0416
Svetlana Hensman TU Dublin, Ireland https://orcid.org/0000-0002-1804-2925
Markus Hofmann TU Dublin, Ireland https://orcid.org/0000-0002-9215-8862
Jorge Jiménez García TU Dublin, Ireland https://orcid.org/0009-0003-4855-7705
Anthony Kuznetsov TU Dublin, Ireland https://orcid.org/0009-0004-3310-4380
Conrad Marais TU Dublin, Ireland https://orcid.org/0009-0005-9963-9844
Keith Nolan TU Dublin, Ireland https://orcid.org/0000-0001-7974-4253
Ciaran O'Leary TU Dublin, Ireland https://orcid.org/0000-0002-0070-6420
Cianan Nicolai TU Dublin, Ireland https://orcid.org/0009-0008-8245-8870
Andrzej Zero TU Dublin, Ireland https://orcid.org/0009-0003-0326-989X

Palabras clave: Machine Learning aplicado, IA, LLM, ChatGPT, Programas Transformers, Detección, Performance educativa, Aprendizaje Automático Aplicado, Rendimiento

Resumen

Existe un importante vacío en la Investigación de Educación en Computación (CER) sobre el impacto de Modelos de Lenguaje de Gran Escala (LLM) en etapas avanzadas de estudios de grado. Este artículo trata de cubrir este vacío investigando la efectividad de las LLM respondiendo preguntas de examen de Aprendizaje Automático Aplicado en último curso de Grado.

El estudio examina el desempeño de las LLM al responder a una variedad de preguntas de examen, que incluyen modelos de examen diseñados con y sin apuntes, a varios niveles de la Taxonomía de Bloom. Los formatos de pregunta incluyen de respuesta abierta, basadas en tablas, o en figuras.

Para conseguir esta meta, este estudio tiene los siguientes objetivos:

Análisis Comparativo: Comparar respuestas generadas por LLM y por estudiantes para juzgar el desempeño de las LLM.

Evaluación de Detectores: Evaluar la eficacia de diferentes detectores de LLM. Además, juzgar la eficacia de los detectores sobre texto alterado por alumnos con el objetivo de engañar a los detectores.

El método investigador de este artículo incorpora una relación entre seis alumnos y ocho profesores. Los estudiantes juegan un rol integral para determinar la dirección del proyecto, en especial en áreas poco conocidas para el profesorado, como el uso de herramientas de detección de LLM.

Este estudio contribuye a entender el rol de las LLM en el ámbito de la educación universitaria, con implicaciones para el diseño de futuros curriculums y técnicas de evaluación.

Descargas

Citas

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., & Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of AI code generation. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 500–506. https://doi.org/10.1145/3545945.35697599

Becker, B. A., Denny, P., Pettit, R., Bouchard, D., Bouvier, D. J., Harrington, B., Kamil, A., Karkare, A., McDonald, C., Osera, P.-M., Pearce, J. L., & Prather, J. (2019). Compiler error messages considered unhelpful: The landscape of text-based programming error message research. Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education, 177–210. https://doi.org/10.1145/3344429.33725088

Bernabei, M., Colabianchi, S., Falegnami, A., & Costantino, F. (2023). Students’ use of large language models in engineering education: A case study on technology acceptance, perceptions, efficacy, and detection chances. Computers and Education: Artificial Intelligence, 5, 100172.

Biderman, S., & Raff, E. (2022). Fooling moss detection with pretrained language models. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2933–2943. https://doi.org/10.1145/3511808.35570799

Deloatch, R., Bailey, B. P., & Kirlik, A. (2016). Measuring effects of modality on perceived test anxiety for computer programming exams. SIGCSE ’16 Proceedings of the 47th ACM Technical Symposium on Computing Science Education, 291–296. https://doi.org/10.1145/2839509.28446044

Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B. N., Santos, E. A., & Sarsa, S. (2024). Computing education in the era of generative AI. Communications of the ACM, 67(2), 56–67. https://doi.org/10.1145/36247200

de Raadt, M. (2012). Student created cheat-sheets in examinations: Impact on student outcomes. Proceedings of the Fourteenth Australasian Computing Education Conference, 71–76. http://dl.acm.org/citation.cfm?id=2483716.24837255

Dooley, B., O’Connor Cliodhna, C., Fitzgerald, A., & O'Reilly, A. (2019). My world survey 2: The national study of youth mental health in Ireland.

Eilertsen, T. V., & Valdermo, O. (2000). Open-book assessment: A contribution to improved learning? Studies in Educational Evaluation, 26(2), 91–103. https://doi.org/10.1016/S0191-491X(00)00010-9

Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022). The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. Proceedings of the 24th Australasian Computing Education Conference, 10–19. https://doi.org/10.1145/3511861.35118633

Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E. A., Prather, J., & Becker, B. A. (2023). My AI wants to know if this will be on the exam: Testing OpenAI’s Codex on CS2 programming exercises. Proceedings of the 25th Australasian Computing Education Conference, 97–104. https://doi.org/10.1145/3576123.35761344

Harrington, K., Flint, A., Healey, M., et al. (2014). Engagement through partnership: Students as partners in learning and teaching in higher education. Higher Education Academy

Karvelas, I., Li, A., & Becker, B. A. (2020). The effects of compilation mechanisms and error message presentation on novice programmer behavior. Proceedings of the 51st ACM Technical Symposium on Computer Science Education, 759–765. https://doi.org/10.1145/3328778.33668822

Kazemitabaar, M., Hou, X., Henley, A., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. Proceedings of the 23rd Koli Calling Conference on Computing Education Research.

Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., & Hellas, A. (2023). Comparing code explanations created by students and large language models. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1

Leinonen, J., Hellas, A., Sarsa, S., Reeves, B., Denny, P., Prather, J., & Becker, B. A. (2023). Using large language models to enhance programming error messages. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 563–569. https://doi.org/10.1145/3545945.35697700

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A. D., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., ... Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://doi.org/10.1126/science.abq11588

Lortie-Forgues, H., & Inglis, M. (2019). Rigorous large-scale educational RCTs are often uninformative: Should we be concerned? Educational Researcher, 48(3), 158–166.

MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., & Leinonen, J. (2023). Experiences from using code explanations generated by large language models in a web software development e-book. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937. https://doi.org/10.1145/3545945.35697855

Nicks, C., Mitchell, E., Rafailov, R., Sharma, A., Manning, C. D., Finn, C., & Ermon, S. (2023, October). Language model detectors are easily optimized against. In The Twelfth International Conference on Learning Representations.

Nolan, K., & Bergin, S. (2016). The role of anxiety when learning to program: A systematic review of the literature. Proceedings of the 16th Koli Calling International Conference on Computing Education Research, 61–70. https://doi.org/10.1145/2999541.29995577

Nolan, K., Bergin, S., & Mooney, A. (2019). An insight into the relationship between confidence, self-efficacy, anxiety and physiological responses in a CS1 exam-like scenario. Proceedings of the 1st UK & Ireland Computing Education Research Conference, 8(1-8), 1–7. https://doi.org/10.1145/3351287.33512966

Nolan, K., Mooney, A., & Bergin, S. (2015). Facilitating student learning in computer science: Large class sizes and interventions. International Conference on Engaging Pedagogy.

Nolan, K., Mooney, A., & Bergin, S. (2019a). An investigation of gender differences in computer science using physiological, psychological and behavioural metrics. Proceedings of the Twenty-First Australasian Computing Education Conference, 47–55. https://doi.org/10.1145/3286960.32869666

Nolan, K., Mooney, A., & Bergin, S. (2019b). A picture of mental health in first year computer science. Proceedings of the 10th International Conference on Computer Science Education: Innovation and Technology.

Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Caspersen, M. E., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). Transformed by transformers: Navigating the AI coding revolution for computing education: An ITiCSE Working Group conducted by humans. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2, 561–562. https://doi.org/10.1145/3587103.35942066

Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). The robots are here: Navigating the generative AI revolution in computing education. Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education, 108–159.

Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1). https://doi.org/10.1145/3617367

Quille, K., & Bergin, S. (2015). Programming: Factors that influence success revisited and expanded. International Conference on Engaging Pedagogy (ICEP), 3rd and 4th December, College of Computing Technology, Dublin, Ireland.

Quille, K. (2019). Predicting and improving performance on introductory programming courses (CS1) (Doctoral dissertation). National University of Ireland Maynooth.

Quille, K., Bergin, S., & Quille, K. (2019). CS1: how will they do? How can we help? A decade of research and practice research and practice. Computer Science Education, 29(2-3), 254–282. https://doi.org/10.1080/08993408.2019.16126799

Quille, K., Nolan, K., Becker, B. A., & McHugh, S. (2021). Developing an open-book online exam for final year students. Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, 338–344. https://doi.org/10.1145/3430665.34563733

Quille, K., Nam Liao, S., Costelloe, E., Nolan, K., Mooney, A., & Shah, K. (2022). PreSS: Predicting Student Success Early in CS1. A Pilot International Replication and Generalization Study. Proceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (ITiCSE '22). Association for Computing Machinery, New York, NY, USA, 54–60. https://doi.org/10.1145/3502718.3524755

Quille, K., Nolan, K., McHugh, S., & Becker, B. A. (2020). Associated exam papers and module descriptors. Available at: http://tiny.cc/ITiCSE21OpenBook

Ribeiro, F., de Macedo, J. N. C., Tsushima, K., Abreu, R., & Saraiva, J. (2023). GPT-3-Powered type error debugging: Investigating the use of large language models for code repair. Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering (pp. 111–124). https://doi.org/10.1145/3623476.3623522

Santos, E. A., Prasad, P., & Becker, B. A. (2023). Always provide context: The effects of code context on programming error message enhancement. Proceedings of the ACM Conference on Global Computing Education Vol 1 (pp. 147–153).

Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (pp. 78–92).

Schrier, J. (2024). Comment on "Comparing the Performance of College Chemistry Students with ChatGPT for Calculations Involving Acids and Bases". Journal of Chemical Education.

Shields, C. (2023). ChatGPT for teachers and students. Ingram Content Goup UK

Sonkar, S., Chen, X., Le, M., Liu, N., Basu Mallick, D., & Baraniuk, R. (2024). Code soliloquies for accurate calculations in large language models. In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp. 828-835).

Trochim, W. M. (2006). Types of reliability. research methods knowledge base. Web Center for Social Research Methods. Retrieved from: http://www.socialresearchmethods.net/kb/reltypes.php

Wermelinger, M. (2023). Using GitHub Copilot to solve simple programming problems. Proceedings of the 54th ACM Technical Symposium on Computer Science Education (pp. 172–178). https://doi.org/10.1145/3545945.3569830