Machine vs Machine: Large Language Models (LLMs) in Applied Machine Learning High-Stakes Open-Book Exams

Authors

DOI: https://doi.org/10.6018/red.603001
Keywords: Applied Machine Learning, AI, LLMs, ChatGPT, Transformers, Detection, Performance

Abstract

There is a significant gap in Computing Education Research (CER) concerning the impact of Large Language Models (LLMs) in advanced stages of degree programmes. This study aims to address this gap by investigating the effectiveness of LLMs in answering exam questions within an applied machine learning final-year undergraduate course.

The research examines the performance of LLMs in responding to a range of exam questions, including proctored closed-book and open-book questions spanning various levels of Bloom’s Taxonomy. Question formats encompassed open-ended, tabular data-based, and figure-based inquiries.

To achieve this aim, the study has the following objectives:

Comparative Analysis: To compare LLM-generated exam answers with actual student submissions to assess LLM performance.

Detector Evaluation: To evaluate the efficacy of LLM detectors by directly inputting LLM-generated responses into these detectors. Additionally, assess detector performance on tampered LLM outputs designed to conceal their AI-generated origin.

The research methodology used for this paper incorporates a staff-student partnership model involving eight academic staff and six students. Students play integral roles in shaping the project’s direction, particularly in areas unfamiliar to academic staff, such as specific tools to avoid LLM detection.

This study contributes to the understanding of LLMs' role in advanced education settings, with implications for future curriculum design and assessment methodologies.

Downloads

Download data is not yet available.

References

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., & Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of AI code generation. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 500–506. https://doi.org/10.1145/3545945.35697599

Becker, B. A., Denny, P., Pettit, R., Bouchard, D., Bouvier, D. J., Harrington, B., Kamil, A., Karkare, A., McDonald, C., Osera, P.-M., Pearce, J. L., & Prather, J. (2019). Compiler error messages considered unhelpful: The landscape of text-based programming error message research. Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education, 177–210. https://doi.org/10.1145/3344429.33725088

Bernabei, M., Colabianchi, S., Falegnami, A., & Costantino, F. (2023). Students’ use of large language models in engineering education: A case study on technology acceptance, perceptions, efficacy, and detection chances. Computers and Education: Artificial Intelligence, 5, 100172.

Biderman, S., & Raff, E. (2022). Fooling moss detection with pretrained language models. Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2933–2943. https://doi.org/10.1145/3511808.35570799

Deloatch, R., Bailey, B. P., & Kirlik, A. (2016). Measuring effects of modality on perceived test anxiety for computer programming exams. SIGCSE ’16 Proceedings of the 47th ACM Technical Symposium on Computing Science Education, 291–296. https://doi.org/10.1145/2839509.28446044

Denny, P., Prather, J., Becker, B. A., Finnie-Ansley, J., Hellas, A., Leinonen, J., Luxton-Reilly, A., Reeves, B. N., Santos, E. A., & Sarsa, S. (2024). Computing education in the era of generative AI. Communications of the ACM, 67(2), 56–67. https://doi.org/10.1145/36247200

de Raadt, M. (2012). Student created cheat-sheets in examinations: Impact on student outcomes. Proceedings of the Fourteenth Australasian Computing Education Conference, 71–76. http://dl.acm.org/citation.cfm?id=2483716.24837255

Dooley, B., O’Connor Cliodhna, C., Fitzgerald, A., & O'Reilly, A. (2019). My world survey 2: The national study of youth mental health in Ireland.

Eilertsen, T. V., & Valdermo, O. (2000). Open-book assessment: A contribution to improved learning? Studies in Educational Evaluation, 26(2), 91–103. https://doi.org/10.1016/S0191-491X(00)00010-9

Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., & Prather, J. (2022). The robots are coming: Exploring the implications of OpenAI Codex on introductory programming. Proceedings of the 24th Australasian Computing Education Conference, 10–19. https://doi.org/10.1145/3511861.35118633

Finnie-Ansley, J., Denny, P., Luxton-Reilly, A., Santos, E. A., Prather, J., & Becker, B. A. (2023). My AI wants to know if this will be on the exam: Testing OpenAI’s Codex on CS2 programming exercises. Proceedings of the 25th Australasian Computing Education Conference, 97–104. https://doi.org/10.1145/3576123.35761344

Harrington, K., Flint, A., Healey, M., et al. (2014). Engagement through partnership: Students as partners in learning and teaching in higher education. Higher Education Academy

Karvelas, I., Li, A., & Becker, B. A. (2020). The effects of compilation mechanisms and error message presentation on novice programmer behavior. Proceedings of the 51st ACM Technical Symposium on Computer Science Education, 759–765. https://doi.org/10.1145/3328778.33668822

Kazemitabaar, M., Hou, X., Henley, A., Ericson, B. J., Weintrop, D., & Grossman, T. (2023). How novices use LLM-based code generators to solve CS1 coding tasks in a self-paced learning environment. Proceedings of the 23rd Koli Calling Conference on Computing Education Research.

Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., & Hellas, A. (2023). Comparing code explanations created by students and large language models. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1

Leinonen, J., Hellas, A., Sarsa, S., Reeves, B., Denny, P., Prather, J., & Becker, B. A. (2023). Using large language models to enhance programming error messages. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 563–569. https://doi.org/10.1145/3545945.35697700

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Lago, A. D., Hubert, T., Choy, P., de Masson d’Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., ... Vinyals, O. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092–1097. https://doi.org/10.1126/science.abq11588

Lortie-Forgues, H., & Inglis, M. (2019). Rigorous large-scale educational RCTs are often uninformative: Should we be concerned? Educational Researcher, 48(3), 158–166.

MacNeil, S., Tran, A., Hellas, A., Kim, J., Sarsa, S., Denny, P., Bernstein, S., & Leinonen, J. (2023). Experiences from using code explanations generated by large language models in a web software development e-book. Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 931–937. https://doi.org/10.1145/3545945.35697855

Nicks, C., Mitchell, E., Rafailov, R., Sharma, A., Manning, C. D., Finn, C., & Ermon, S. (2023, October). Language model detectors are easily optimized against. In The Twelfth International Conference on Learning Representations.

Nolan, K., & Bergin, S. (2016). The role of anxiety when learning to program: A systematic review of the literature. Proceedings of the 16th Koli Calling International Conference on Computing Education Research, 61–70. https://doi.org/10.1145/2999541.29995577

Nolan, K., Bergin, S., & Mooney, A. (2019). An insight into the relationship between confidence, self-efficacy, anxiety and physiological responses in a CS1 exam-like scenario. Proceedings of the 1st UK & Ireland Computing Education Research Conference, 8(1-8), 1–7. https://doi.org/10.1145/3351287.33512966

Nolan, K., Mooney, A., & Bergin, S. (2015). Facilitating student learning in computer science: Large class sizes and interventions. International Conference on Engaging Pedagogy.

Nolan, K., Mooney, A., & Bergin, S. (2019a). An investigation of gender differences in computer science using physiological, psychological and behavioural metrics. Proceedings of the Twenty-First Australasian Computing Education Conference, 47–55. https://doi.org/10.1145/3286960.32869666

Nolan, K., Mooney, A., & Bergin, S. (2019b). A picture of mental health in first year computer science. Proceedings of the 10th International Conference on Computer Science Education: Innovation and Technology.

Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Caspersen, M. E., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). Transformed by transformers: Navigating the AI coding revolution for computing education: An ITiCSE Working Group conducted by humans. Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2, 561–562. https://doi.org/10.1145/3587103.35942066

Prather, J., Denny, P., Leinonen, J., Becker, B. A., Albluwi, I., Craig, M., Keuning, H., Kiesler, N., Kohn, T., Luxton-Reilly, A., MacNeil, S., Petersen, A., Pettit, R., Reeves, B. N., & Savelka, J. (2023). The robots are here: Navigating the generative AI revolution in computing education. Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education, 108–159.

Prather, J., Reeves, B. N., Denny, P., Becker, B. A., Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-Ansley, J., & Santos, E. A. (2023). “It’s weird that it knows what I want”: Usability and interactions with copilot for novice programmers. ACM Transactions on Computer-Human Interaction, 31(1). https://doi.org/10.1145/3617367

Quille, K., & Bergin, S. (2015). Programming: Factors that influence success revisited and expanded. International Conference on Engaging Pedagogy (ICEP), 3rd and 4th December, College of Computing Technology, Dublin, Ireland.

Quille, K. (2019). Predicting and improving performance on introductory programming courses (CS1) (Doctoral dissertation). National University of Ireland Maynooth.

Quille, K., Bergin, S., & Quille, K. (2019). CS1: how will they do? How can we help? A decade of research and practice research and practice. Computer Science Education, 29(2-3), 254–282. https://doi.org/10.1080/08993408.2019.16126799

Quille, K., Nolan, K., Becker, B. A., & McHugh, S. (2021). Developing an open-book online exam for final year students. Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, 338–344. https://doi.org/10.1145/3430665.34563733

Quille, K., Nam Liao, S., Costelloe, E., Nolan, K., Mooney, A., & Shah, K. (2022). PreSS: Predicting Student Success Early in CS1. A Pilot International Replication and Generalization Study. Proceedings of the 27th ACM Conference on on Innovation and Technology in Computer Science Education Vol. 1 (ITiCSE '22). Association for Computing Machinery, New York, NY, USA, 54–60. https://doi.org/10.1145/3502718.3524755

Quille, K., Nolan, K., McHugh, S., & Becker, B. A. (2020). Associated exam papers and module descriptors. Available at: http://tiny.cc/ITiCSE21OpenBook

Ribeiro, F., de Macedo, J. N. C., Tsushima, K., Abreu, R., & Saraiva, J. (2023). GPT-3-Powered type error debugging: Investigating the use of large language models for code repair. Proceedings of the 16th ACM SIGPLAN International Conference on Software Language Engineering (pp. 111–124). https://doi.org/10.1145/3623476.3623522

Santos, E. A., Prasad, P., & Becker, B. A. (2023). Always provide context: The effects of code context on programming error message enhancement. Proceedings of the ACM Conference on Global Computing Education Vol 1 (pp. 147–153).

Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large language models (GPT-4) no longer struggle to pass assessments in higher education programming courses. Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (pp. 78–92).

Schrier, J. (2024). Comment on "Comparing the Performance of College Chemistry Students with ChatGPT for Calculations Involving Acids and Bases". Journal of Chemical Education.

Shields, C. (2023). ChatGPT for teachers and students. Ingram Content Goup UK

Sonkar, S., Chen, X., Le, M., Liu, N., Basu Mallick, D., & Baraniuk, R. (2024). Code soliloquies for accurate calculations in large language models. In Proceedings of the 14th Learning Analytics and Knowledge Conference (pp. 828-835).

Trochim, W. M. (2006). Types of reliability. research methods knowledge base. Web Center for Social Research Methods. Retrieved from: http://www.socialresearchmethods.net/kb/reltypes.php

Wermelinger, M. (2023). Using GitHub Copilot to solve simple programming problems. Proceedings of the 54th ACM Technical Symposium on Computer Science Education (pp. 172–178). https://doi.org/10.1145/3545945.3569830

Published
30-05-2024
How to Cite
Quille, K., Alattyanyi, C., Becker, B. A., Faherty, R., Gordon, D., Harte, M., … Zero, A. (2024). Machine vs Machine: Large Language Models (LLMs) in Applied Machine Learning High-Stakes Open-Book Exams. Distance Education Journal, 24(78). https://doi.org/10.6018/red.603001