Fonteles, Joyce Horn; Sivakumaran, Nithin; Cohn, Clayton; Coursey, Austin; Yu, Shoubin; Stengel-Eskin, Elias; Ashwin, T. S.; Bansal, Mohit; Biswas, Gautam. (2026).听.听16th International Learning Analytics and Knowledge Conference, LAK 2026, 536鈥546.听
This paper describes a method that uses large language models, or LLMs, to combine information from several sources and infer students鈥 metacognitive behaviors, meaning the ways they plan, monitor, reflect on, and adjust their own learning. The study analyzes multimodal classroom data, including students鈥 movements, gaze, gestures, and speech, collected during a mixed-reality simulation shown on a classroom screen. Instead of processing each type of data separately, the researchers use the LLM at a late-fusion stage, meaning after the different signals have already been analyzed, to bring them together through prompting strategies such as zero-shot prompting, self-consistency reasoning, and carefully designed prompts. They also test whether an 鈥淟LM-as-a-Judge,鈥 an LLM used to evaluate model outputs, can reliably assess these behavior labels at scale and reduce the need for manual review. Using a balanced set of human-checked examples and control cases, they compare text-based LLMs, such as GPT-5, with visual-language models, or VLMs, such as Qwen2.5-VL, which can directly process images or video. The results show that text-based LLMs used in this late-fusion way can outperform VLMs even without raw video, and that prompt design can shift the model toward being more precise or more sensitive when the behavior is subtle or brief. Overall, the findings suggest that LLMs can be effective tools for combining multimodal learning data and that LLM-as-a-Judge can support scalable, human-in-the-loop evaluation.

Figure 1:
On-screen, each tracked student is represented as a molecule. As they move through the classroom space, their corresponding molecule moves in real time, allowing them to explore the photosynthesis cycle through embodied interaction.