Evaluating Large Language Models for Educational Measurement Insights from Automated and Human Scoring of Language Exams
DOI:
https://doi.org/10.37965/jait.2026.0949Keywords:
Artificial intelligence in education, automated assessment, educational technology, human–AI comparison, language teaching, large language models (LLMs)Abstract
This study investigates the use of large language models (LLMs)—ChatGPT-5, Claude Opus 4.1, Gemini Advanced 2.5 Pro, DeepSeek Pro, Qwen-3 Max, and Mistral Le Chat Pro—and a locally fine-tuned LLaMA 3.3 70B Instruct model for automating assessment tasks in language education. Specifically, the study looks to examine LLM capabilities in automating assessments with authentic midterm exam sheets from a “German as a Foreign Language” (GFL) course in three different scenarios: (1) general-purpose LLM with pre-corrected samples giving a score, (2) localized grading using a fine-tuned LLaMA model and reference answer keys, and (3) manual grading with and without a visual overlay technique. Human grading, supported by a structured scoring process, remained nearly perfect in terms of accuracy and reliability, whereas the local model failed because OCR and visual input techniques did not produce usable outputs. These findings reinforce the necessity for domain-specific adaptation, the design of stronger OCR and multimodal workflows, and explainable scoring mechanisms before local AI solutions can reliably contribute to the assessment of language learning tasks in applied educational settings.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Author

This work is licensed under a Creative Commons Attribution 4.0 International License.
