We Benchmarked LLMs Against Purpose-Built Models. It Wasn't Close.

image of students collaborating on a digital project in a technology lab

By Prarthana Bhattacharyya and Joshua Mitton

At Eedi, we focus on identifying student misconceptions in mathematics. There are recurring patterns of misunderstanding that lead to predictable mistakes. If we can anticipate these misconceptions before a topic is taught, teachers can adapt their instruction to improve student outcomes. Learning platforms like ours can also deliver more targeted support. This challenge can be framed as a prediction problem: based on a student’s past answers, we want to forecast their future responses. These predictions might be binary (student will answer correctly vs incorrectly) or multi-class (which answer option the student will select).

To address this prediction problem, we have developed a compact, domain-specific Knowledge Tracing (KT) model. This model is trained on student interaction data and optimised for high accuracy, fast inference, and scalable deployment. Their purpose is to make real-time predictions about how students will respond to upcoming questions. This enables early detection of misconceptions and better learning systems.

Large Language Models (LLMs) have been shown to have strong performance on a range of tasks including reasoning, maths and coding https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/. In this context, 'reasoning' refers to the model's ability to produce logically structured outputs by leveraging statistical patterns in training data. It does not imply any form of cognition or understanding. This includes tasks such as chain-of-thought prompting for arithmetic, multi-step question answering, and logical deduction, where the model generates plausible intermediate steps without necessarily comprehending the underlying concepts. The enormous progress in LLMs motivated us to explore how well they perform at predicting students future responses to questions. Predicting student performance on multiple-choice mathematics problems is a task that should require good reasoning ability as well as maths understanding. Our goal is to investigate whether specialised KT models maintain their advantage over general-purpose LLMs in predicting student responses.

In this post, we benchmark a set of KT models and LLMs on the binary classification task on the exact same dataset. We predict whether a student will answer a question correctly and compare them across three critical dimensions: predictive performance, latency, and cost. We find that despite advances in LLMs, specialised KT models consistently outperform them in all three areas when applied to student response predictions in EdTech settings.

Model Performance: Specialised KT Models Lead over LLMs

We compare the predictive performance of various models on the binary classification task, i.e., predicting whether a student will answer a question correctly or incorrectly. All models were evaluated on the same dataset using two key metrics: accuracy and F1 Score.

Eedi’s flagship model, LLM KT, is an encoder-decoder temporal transformer, which uses Qwen3-0.6B embeddings for the question text. Importantly, we only use Qwen as a feature extractor to obtain fixed vector representations of questions. These question embeddings are computed once and cached offline. At inference time, no LLM is involved in the actual student performance prediction. The temporality and modelling of student understanding is handled by our small, custom temporal transformer.

Accuracy measures the proportion of correct predictions. LLM KT achieves the highest accuracy at 72.8%, narrowly outperforming both SAKT (72.7%) and DKT (71.8%), two well-established knowledge tracing (KT) models. Interestingly, all domain-specific KT models outperform the general-purpose LLMs like GPT-4o-mini (58.6%), Qwen2.5-7B-Instruct (64.6%), and Gemini-2.5-flash-lite (66.5%). We also benchmark two Llama-1B variants: a LoRA fine-tuned Llama-1B (71.0%), while Llama-1B zero-shot performs substantially worse at 33.5%. This reinforces the value of specialised models trained on structured educational data for this task.

For context, we also include a dataset bias baseline that simply predicts the average correctness rate for each question (i.e. how often it's answered correctly in the dataset). While it doesn't personalise predictions, it achieves 66.5% accuracy. Notably, several LLMs in our benchmark, despite their billions of parameters, fail to surpass this naive baseline. This makes it a critical baseline any viable model must beat.

Precision tells us how often the model is right when it predicts a student will answer correctly. Recall tells us how many of the actually correct answers the model managed to identify. Improving one typically comes at the cost of the other. The F1 score captures this trade-off by combining both into a single measure, rewarding models that perform well on both rather than excelling at one while neglecting the other. The F1 score therefore is more informative when dealing with class imbalance. Once again, LLM KT leads with an F1 score of 0.674, followed closely by SAKT (0.669) and DKT (0.650). The LLMs perform lower in this metric as well, with GPT-4o-mini scoring 0.579, Qwen2.5-7B-Instruct 0.533 and Gemini-2.5-flash-lite 0.527. Among the Llama baselines, the Llama-1B LoRA fine-tune improves to 0.592, whereas Llama-1B zero-shot remains low at 0.251. These results affirm that general-purpose LLMs struggle with student-level predictions, likely due to the lack of fine-grained educational supervision.

Our results show that specialised Knowledge Tracing (KT) models outperform general-purpose Large Language Models (LLMs) when it comes to predicting student responses. Across both metrics (accuracy and F1 score), domain-specific models like our LLM KT, SAKT, and DKT consistently deliver stronger performance than smaller LLMs such as GPT-4o-mini, Gemini-2.5-flash-lite, Qwen2.5-7B-Instruct and Llama-1B. They remain robust even against a fine-tuned small LLM baseline (Llama-1B LoRA). This highlights the importance of task-specific architectures in educational prediction tasks. While LLMs excel at general reasoning, purpose-built KT models remain the most effective and reliable solution for identifying student misconceptions.

Latency and Model Size: Specialised KT Models are faster and smaller than LLMs

We also compare models on latency (inference time per student) and model size (number of parameters). These are critical factors for real-time applications and large-scale deployment in educational learning platforms. All latency measurements were recorded on a standard CPU Intel(R) Xeon(R) Platinum 8171M @ 2.60GHz. We note that closed source LLMs will be run on different hardware and we access these via an API.

Traditional KT models like DKT, SAKT, and our LLM KT deliver fast predictions, with latencies under 0.25 seconds per student. In contrast, LLMs are orders of magnitude slower. GPT-4o-mini takes 3.1 seconds, Gemini-2.5-flash-lite takes 128 seconds, and Qwen2.5-7B-Instruct requires 3299 seconds per student. Llama 1B variants also take 1598.8s per student. Despite this latency increase, none of the LLMs outperform the KT models in accuracy.

Specialised KT models are also extremely compact, with model sizes ranging from 0.58M to 0.85M parameters. Our LLM KT model is only 0.73M while offering strong performance. In contrast, general-purpose LLMs are vastly larger. GPT-4o-mini has 8B parameters, Gemini-2.5-flash-lite has 4B, Qwen2.5-7B-Instruct has 7B and Llama-1B variants have 1B parameters.

This huge efficiency gap in terms of latency and model size further reinforces that specialised KT models are far better suited for scalable, real-time prediction tasks. This is especially important for deployment in resource-constrained environments, where lightweight prediction models can enable educational tools to reach learners at a global scale.

Deployment Cost: Specialised KT Models are cheaper than LLMs

Cost is a major consideration for real-world deployment at scale. This is especially true for education platforms aiming to support thousands of students. The chart below shows the annual inference cost for 100,000 students, each receiving 40 predictions per year. For most LLM models, we benchmarked on 1,600 students on 40 questions (64,000 predictions) and extrapolated to 100,000 students. For Gemini-2.5-flash-lite, daily rate limits of 10,000 requests per day on Tier-1 restricted us to 200 students, from which we extrapolated accordingly.

Specialised KT models like DKT, SAKT, and our LLM KT cost less than $2/year to serve this workload. In contrast, general-purpose LLMs are vastly morxe expensive. GPT-4o-mini costs approximately $2,322/year, Gemini-2.5-flash-lite costs $1,230/year, Llama-1B variants $11,991/year and Qwen2.5-7B-Instruct reaches $24,741/year. This means KT models are 615-12,400 times cheaper than LLMs for the same task, while also offering higher accuracy.

For any large-scale EdTech deployment, this dramatic cost gap makes specialised KT models the better option today.

Fast, Cheap, Accurate: Why KT Models are better than LLMs for Deployments

To choose a model for real-world use, we want to see how latency, cost, and accuracy trade off at scale. The chart below compares models on latency and deployment cost at 100K students/year. The x-axis shows latency per student (seconds). The y-axis shows annual cost for 100K students (USD). Both axes are on a log scale, so each step represents an order-of-magnitude change. Colour indicates accuracy using the scale on the right: greener points are more accurate, while yellow/orange/red points are less accurate. The best-performing region is the bottom-left (fast and cheap), with green colouring (high accuracy).

Overall, the chart highlights an efficient frontier where specialised KT models achieve the best balance of speed, cost, and accuracy for real EdTech deployment.

Specialised KT Models outperform LLMs for EdTech

Despite the rapid rise of general-purpose LLMs, our findings show that specialised Knowledge Tracing (KT) models are still the best choice for predicting students responses in an EdTech setting. They outperform LLMs (GPT-4o-mini, Gemini-2.5-flash-lite, Llama-1B, Llama-1B-LoRA-finetune) on accuracy, are over 600 times cheaper to deploy at scale, and offer millisecond-level latency without requiring GPUs. While LLMs excel at broad reasoning tasks, they fall short when applied to student interaction data and are significantly slower. This is not to say LLMs don’t have a place in a pedagogical setting, but they should be used in settings where they reduce costs or improve learning outcomes, not as a default choice. To uncover misconceptions of the most learners a scalable, real-time system is required; currently purpose-built KT models remain the most effective and practical solution today.

‍