Evaluating Large Language Models for Educational Measurement Insights from Automated and Human Scoring of Language Exams

Bora Basaran

doi:10.37965/jait.2026.0949

Evaluating Large Language Models for Educational Measurement Insights from Automated and Human Scoring of Language Exams

Authors

Bora Basaran Department of Foreign Languages Education, Anadolu University, Eskişehir, Türkiye https://orcid.org/0000-0003-0251-5895

DOI:

https://doi.org/10.37965/jait.2026.0949

Keywords:

Artificial intelligence in education, automated assessment, educational technology, human–AI comparison, language teaching, large language models (LLMs)

Abstract

This study investigates the use of large language models (LLMs)—ChatGPT-5, Claude Opus 4.1, Gemini Advanced 2.5 Pro, DeepSeek Pro, Qwen-3 Max, and Mistral Le Chat Pro—and a locally fine-tuned LLaMA 3.3 70B Instruct model for automating assessment tasks in language education. Specifically, the study looks to examine LLM capabilities in automating assessments with authentic midterm exam sheets from a “German as a Foreign Language” (GFL) course in three different scenarios: (1) general-purpose LLM with pre-corrected samples giving a score, (2) localized grading using a fine-tuned LLaMA model and reference answer keys, and (3) manual grading with and without a visual overlay technique. Human grading, supported by a structured scoring process, remained nearly perfect in terms of accuracy and reliability, whereas the local model failed because OCR and visual input techniques did not produce usable outputs. These findings reinforce the necessity for domain-specific adaptation, the design of stronger OCR and multimodal workflows, and explainable scoring mechanisms before local AI solutions can reliably contribute to the assessment of language learning tasks in applied educational settings.