Machine vs Machine: Large Language Models (LLMs) in Applied Machine Learning High-Stakes Open-Book Exams
Abstract
There is a significant gap in Computing Education Research (CER) concerning the impact of Large Language Models (LLMs) in advanced stages of degree programmes. This study aims to address this gap by investigating the effectiveness of LLMs in answering exam questions within an applied machine learning final-year undergraduate course.
The research examines the performance of LLMs in responding to a range of exam questions, including proctored closed-book and open-book questions spanning various levels of Bloom’s Taxonomy. Question formats encompassed open-ended, tabular data-based, and figure-based inquiries.
To achieve this aim, the study has the following objectives:
Comparative Analysis: To compare LLM-generated exam answers with actual student submissions to assess LLM performance.
Detector Evaluation: To evaluate the efficacy of LLM detectors by directly inputting LLM-generated responses into these detectors. Additionally, assess detector performance on tampered LLM outputs designed to conceal their AI-generated origin.
The research methodology used for this paper incorporates a staff-student partnership model involving eight academic staff and six students. Students play integral roles in shaping the project’s direction, particularly in areas unfamiliar to academic staff, such as specific tools to avoid LLM detection.
This study contributes to the understanding of LLMs' role in advanced education settings, with implications for future curriculum design and assessment methodologies.
Downloads
References
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., & Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of AI code generation. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 500–506. https://doi.org/10.1145/3545945.35697599
Becker, B. A., Denny, P., Pettit, R., Bouchard, D., Bouvier, D. J., Harrington, B., Kamil, A., Karkare, A., McDonald, C., Osera, P.-M., Pearce, J. L., & Prather, J. (2019). Compiler error messages considered unhelpful: The landscape of text-based programming error message research. Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education, 177–210. https://doi.org/10.1145/3344429.33725088
Bernabei, M., Colabianchi, S., Falegnami, A., & Costantino, F. (2023). Students’ use of large language models in engineering education: A case study on technology acceptance, perceptions, efficacy, and detection chances. Computers and Education: Artificial Intelligence, 5, 100172.
Biderman, S., & Raff, E. (2022). Fooling moss detection with pretrained language models. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2933–2943. https://doi.org/10.1145/3511808.35570799
Deloatch, R., Bailey, B. P., & Kirlik, A. (2016). Measuring effects of modality on perceived test anxiety for computer programming exams. SIGCSE ’16 Proceedings of the 47th ACM Technical Symposium on Computing Science Education, 291–296. https://doi.org/10.1145/2839509.28446044
Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B. N., Santos, E. A., & Sarsa, S. (2024). Computing education in the era of generative AI. Communications of the ACM, 67(2), 56–67. https://doi.org/10.1145/36247200
de Raadt, M. (2012). Student created cheat-sheets in examinations: Impact on student outcomes. Proceedings of the Fourteenth Australasian Computing Education Conference, 71–76. http://dl.acm.org/citation.cfm?id=2483716.24837255
Dooley, B., O’Connor Cliodhna, C., Fitzgerald, A., & O'Reilly, A. (2019). My world survey 2: The national study of youth mental health in Ireland.
Eilertsen, T. V., & Valdermo, O. (2000). Open-book assessment: A contribution to improved learning? Studies in Educational Evaluation, 26(2), 91–103. https://doi.org/10.1016/S0191-491X(00)00010-9
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022). The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. Proceedings of the 24th Australasian Computing Education Conference, 10–19. https://doi.org/10.1145/3511861.35118633
Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E. A., Prather, J., & Becker, B. A. (2023). My AI wants to know if this will be on the exam: Testing OpenAI’s Codex on CS2 programming exercises. Proceedings of the 25th Australasian Computing Education Conference, 97–104. https://doi.org/10.1145/3576123.35761344
Harrington, K., Flint, A., Healey, M., et al. (2014). Engagement through partnership: Students as partners in learning and teaching in higher education. Higher Education Academy
Karvelas, I., Li, A., & Becker, B. A. (2020). The effects of compilation mechanisms and error message presentation on novice programmer behavior. Proceedings of the 51st ACM Technical Symposium on Computer Science Education, 759–765. https://doi.org/10.1145/3328778.33668822
Kazemitabaar, M., Hou, X., Henley, A., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. Proceedings of the 23rd Koli Calling Conference on Computing Education Research.
Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., & Hellas, A. (2023). Comparing code explanations created by students and large language models. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1
Leinonen, J., Hellas, A., Sarsa, S., Reeves, B., Denny, P., Prather, J., & Becker, B. A. (2023). Using large language models to enhance programming error messages. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 563–569. https://doi.org/10.1145/3545945.35697700
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A. D., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., ... Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://doi.org/10.1126/science.abq11588
Lortie-Forgues, H., & Inglis, M. (2019). Rigorous large-scale educational RCTs are often uninformative: Should we be concerned? Educational Researcher, 48(3), 158–166.
MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., & Leinonen, J. (2023). Experiences from using code explanations generated by large language models in a web software development e-book. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937. https://doi.org/10.1145/3545945.35697855
Nicks, C., Mitchell, E., Rafailov, R., Sharma, A., Manning, C. D., Finn, C., & Ermon, S. (2023, October). Language model detectors are easily optimized against. In The Twelfth International Conference on Learning Representations.
Nolan, K., & Bergin, S. (2016). The role of anxiety when learning to program: A systematic review of the literature. Proceedings of the 16th Koli Calling International Conference on Computing Education Research, 61–70. https://doi.org/10.1145/2999541.29995577
Nolan, K., Bergin, S., & Mooney, A. (2019). An insight into the relationship between confidence, self-efficacy, anxiety and physiological responses in a CS1 exam-like scenario. Proceedings of the 1st UK & Ireland Computing Education Research Conference, 8(1-8), 1–7. https://doi.org/10.1145/3351287.33512966
Nolan, K., Mooney, A., & Bergin, S. (2015). Facilitating student learning in computer science: Large class sizes and interventions. International Conference on Engaging Pedagogy.
Nolan, K., Mooney, A., & Bergin, S. (2019a). An investigation of gender differences in computer science using physiological, psychological and behavioural metrics. Proceedings of the Twenty-First Australasian Computing Education Conference, 47–55. https://doi.org/10.1145/3286960.32869666
Nolan, K., Mooney, A., & Bergin, S. (2019b). A picture of mental health in first year computer science. Proceedings of the 10th International Conference on Computer Science Education: Innovation and Technology.
Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Caspersen, M. E., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). Transformed by transformers: Navigating the AI coding revolution for computing education: An ITiCSE Working Group conducted by humans. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2, 561–562. https://doi.org/10.1145/3587103.35942066
Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). The robots are here: Navigating the generative AI revolution in computing education. Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education, 108–159.
Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1). https://doi.org/10.1145/3617367
Quille, K., & Bergin, S. (2015). Programming: Factors that influence success revisited and expanded. International Conference on Engaging Pedagogy (ICEP), 3rd and 4th December, College of Computing Technology, Dublin, Ireland.
Quille, K. (2019). Predicting and improving performance on introductory programming courses (CS1) (Doctoral dissertation). National University of Ireland Maynooth.
Quille, K., Bergin, S., & Quille, K. (2019). CS1: how will they do? How can we help? A decade of research and practice research and practice. Computer Science Education, 29(2-3), 254–282. https://doi.org/10.1080/08993408.2019.16126799
Quille, K., Nolan, K., Becker, B. A., & McHugh, S. (2021). Developing an open-book online exam for final year students. Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, 338–344. https://doi.org/10.1145/3430665.34563733
Quille, K., Nam Liao, S., Costelloe, E., Nolan, K., Mooney, A., & Shah, K. (2022). PreSS: Predicting Student Success Early in CS1. A Pilot International Replication and Generalization Study. Proceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (ITiCSE '22). Association for Computing Machinery, New York, NY, USA, 54–60. https://doi.org/10.1145/3502718.3524755
Quille, K., Nolan, K., McHugh, S., & Becker, B. A. (2020). Associated exam papers and module descriptors. Available at: http://tiny.cc/ITiCSE21OpenBook
Ribeiro, F., de Macedo, J. N. C., Tsushima, K., Abreu, R., & Saraiva, J. (2023). GPT-3-Powered type error debugging: Investigating the use of large language models for code repair. Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering (pp. 111–124). https://doi.org/10.1145/3623476.3623522
Santos, E. A., Prasad, P., & Becker, B. A. (2023). Always provide context: The effects of code context on programming error message enhancement. Proceedings of the ACM Conference on Global Computing Education Vol 1 (pp. 147–153).
Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (pp. 78–92).
Schrier, J. (2024). Comment on "Comparing the Performance of College Chemistry Students with ChatGPT for Calculations Involving Acids and Bases". Journal of Chemical Education.
Shields, C. (2023). ChatGPT for teachers and students. Ingram Content Goup UK
Sonkar, S., Chen, X., Le, M., Liu, N., Basu Mallick, D., & Baraniuk, R. (2024). Code soliloquies for accurate calculations in large language models. In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp. 828-835).
Trochim, W. M. (2006). Types of reliability. research methods knowledge base. Web Center for Social Research Methods. Retrieved from: http://www.socialresearchmethods.net/kb/reltypes.php
Wermelinger, M. (2023). Using GitHub Copilot to solve simple programming problems. Proceedings of the 54th ACM Technical Symposium on Computer Science Education (pp. 172–178). https://doi.org/10.1145/3545945.3569830
Copyright (c) 2024 Distance Education Journal
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Las obras que se publican en esta revista están sujetas a los siguientes términos:
1. El Servicio de Publicaciones de la Universidad de Murcia (la editorial) conserva los derechos patrimoniales (copyright) de las obras publicadas, y favorece y permite la reutilización de las mismas bajo la licencia de uso indicada en el punto 2.
2. Las obras se publican en la edición electrónica de la revista bajo una licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 3.0 España (texto legal). Se pueden copiar, usar, difundir, transmitir y exponer públicamente, siempre que: i) se cite la autoría y la fuente original de su publicación (revista, editorial y URL de la obra); ii) no se usen para fines comerciales; iii) se mencione la existencia y especificaciones de esta licencia de uso.
3. Condiciones de auto-archivo. Se permite y se anima a los autores a difundir electrónicamente las versiones pre-print (versión antes de ser evaluada) y/o post-print (versión evaluada y aceptada para su publicación) de sus obras antes de su publicación, ya que favorece su circulación y difusión más temprana y con ello un posible aumento en su citación y alcance entre la comunidad académica. Color RoMEO: verde.