# Zifeng Liu - Human–AI Collaboration in Educational Assessment  Evaluating AI Generated Distractors

## Метаданные

- **Канал:** Cohere
- **YouTube:** https://www.youtube.com/watch?v=X84gmuT9RI8
- **Дата:** 13.04.2026
- **Длительность:** 38:30
- **Просмотры:** 59
- **Источник:** https://ekstraktznaniy.ru/video/45957

## Описание

In this talk, Zifeng will discuss the emerging role of generative AI in educational assessment, with a focus on the automatic generation and evaluation of multiple-choice distractors and feedback in computing and AI education. While large language models show strong potential for producing instructional content, important questions remain regarding the quality, pedagogical validity, and alignment of AI-generated materials with human expectations and learning goals. To address these challenges, this line of work examines how students, experts, and AI systems evaluate and co-create assessment components such as distractors and feedback. Through human–AI collaborative evaluation and experimental comparisons, the research investigates how AI-generated distractors are perceived, how their quality can be systematically assessed, and how automated generation can be integrated into authentic educational contexts. The findings highlight both the opportunities and limitations of current models, 

## Транскрипт

### Segment 1 (00:00 - 05:00) []

So hi everyone for being here. So as Rafin in introduced my name is Zong and I'm now a PhD candidate in educational technology at the University of Florida. So great to be here and connect with you all. So um in today's talk I'll focus on a central question. Can we actually trust AI generated assessment content such as distractors and feedbacks for multi-choice questions? So this talk brings together two connected studies. One presented at ICCE uh 2025 and other at tribal AI. This year I'll share how we um evaluate AI generated distractors and feedback by involving students, experts and AI system together. So and what this reviews um about well AI allies with human judgments or well it falls short. Okay. So now let's begin with the first one. a human AI collaborative assessment of AI generative versus human created multi-choice distractors. So as AI becomes increasingly interest uh integrated into educational tools for example in automating assessments it opens up many new opportunities. One major opportunity is using AI to support the creation of assessment content and multi-choice questions are widely uh used in instruction and assessments for diagn diagnosing students understanding and misconceptions. So a typical multi-choice question has sever uh is your screen sharing on because I cannot see your screen. You can now see my screen — I think. Uh if you share it again. — Yeah. Um let me try again here. Oh. — Oh, sorry about that. — No worries. — Can you all see my screen now? — Yes, we can. Okay, perfect. So, um, a typical multi-choice question here has several parts. A question stand, a key or we say a correct answer and a few distractors. A distractor is defined as a incorrect option presented alongside with the correct answer and the question stem and plays a critical role in diagnostic function of multi- choice questions. So while generative AI can prod like produce the full multi-choice questions and generating the uh distractor high quality distractors is actually the more challenging and pedagogically critical part. The question then correct answer are usually grounded in curriculum and the standards uh and learning objectives. So teachers often prefer to keep those under human control. Um in contrast, distractors um require a deeper understanding of students misconceptions. They need to be uh plausible but still uh incorrect. So since design high quality distractors is resource intensive and time consuming and manual authoring is often um subjective. So it can be difficult for instructors to accurately capture students actual cognitive biases and misconceptions and in large scale online learning environments. It is difficult to identify students u misunderstanding real time and achieve continuous optimization of questions quality at the scale. So this table is from a study published in 2023. So from these works we can observe a clear trend that researchers are increasingly using logic large language model to generate distractors and the evaluation process is becoming more automated often comp complemented by manual review. And looking more closely at the studies published after 2020, we often find that many of them focusing on subjects such as um reading comprehension. But the eval but the evaluation

### Segment 2 (05:00 - 10:00) [5:00]

approaches used in these studies are not always like well suited for assessing distractor quality. For example, some metrics may were may be uh originally developed for tasks such as machine translation rather than for evaluating the pedagogical effectiveness of distractors. So these five studies are from the past uh two or three years. we can see a growing trend of using tools like uh CHBT for automatic distractor generation and with increasing attention to domains such as mathematics and computer science. So but even with the help of large language models generating high quality distractors remains challenging especially in subjects like math is because compared to areas such as reading comprehension relatively fewer uh fewer approaches have been development for domains that require more uh complex reasoning such as uh mathematics and computing. So in addition the evaluation in these studies is still uh largely conducted manually. So before introducing our approach let me briefly walk through how uh distractors are traditionally evaluated. So from a technique perspective most approaches are grounded in um psych psychometric methods. One of the most commonly used framework is called the classical test theory or we call it CTT. Under this distractors are typically evaluated using matrix such as item difficulty and item discrim disc discrimination. The item difficulty reflects how many students answer the question correctly. And item discrimination measures how well a question uh differentiate uh differentiates between high and the low performing students. So in addition to CTD more advanced models such as item response theory and cognitive diagnostic models are also used. These approaches uh provide a way to analyze students uh responses and the underlying skills. But these methods rely heavily on um large scale students response data and are typically applied after the assessment has been administrated. So this makes them less suitable for uh evaluating AI generated distractors at the design stage. So here are some key takeaways from the existing literature. Um first there are very few publicly available data sets and um especially for math and sales education and most of them focus on uh reading comprehension tasks. This is the reason one of the reason why we focus on computer science. Um so the second is that most existing studies uh on distractor generation rely on models such as um chbt or other open source large language models. And the finally there is a gap in how these distractors are evaluated. current approaches often rely on sentiment and similarity metrics and many studies might still depend heavily on manual evaluation. So the overarching question is how can we use generative AI to uh to generate distractors that are pedagogically sound and effective especially for subjects like KF CS and math. So um there are a few key reasons why we focused on K12 says. So first uh SIS education often involves like complex reasoning and problem solving. So this make it particularly challenging to design effective distractors compared to more like textbased subjects like reading comprehension. The second is that because uh K12 says is a rapidly growing area but it is still lacks sufficient high quality assessments uh resources. And finally from the educational uh perspective poorly designed distractors in can easily misled students to um or fail to capture their uh misconceptions. So to address this question uh we collected data from two high school courses programming and statistics offered by the uh Florida virtual

### Segment 3 (10:00 - 15:00) [10:00]

school. So Florida virtual school is a large online education platform that serves middle and high school uh students. Um it provides fully remote learning opportunities. So in total our data sets includes uh 925 multi-choice questions. So this is example question from the data set. It has question standsw answer distracted distractors and often with a table or a graph. So to investigate GBT's ability to generate those uh distractors across different types of questions, we categorized the multi-choice question from both causes based on Bloom's taxomy. Uh specifically, all questions were classified into six cognitive levels. uh from the bottom to the up it's remember, understand, apply, analyze, evaluate and create. We uh we use this uh Bloom's taxonomy because it provides a structured way to capture different levels of cognitive complexity in the assessment questions. Uh because not all the questions are the same, right? and some require simple recall while others involve high order reasoning such as analyze or evaluation. So to do this um we have two researchers independently uh labeled each MCQ and in cases where the uh classification deferred about it's about 39 questions a third researcher reviewed uh the discreies and determined the final level. So the final uh distribution of the MCQs across these categories is show in the table here. So we can see that most questions are concentrated in the lower and uh midle categories while fewer questions fall into high order level such as analyze, evaluate and create. So um then we generated the distractors using a three-step process. First we use the GBD4 API to generate the distractors. We select the GB GBD4 because uh it is the state-of-the-art model at that time and has been widely used in prior research and also the second step we designed a prompt based on the existing studies that the prompt include the question stem the correct answer the broom's taxonomy level and the specific instruction instructional constraints to guide the model in generating high quality distractors. The third step is after the generation we manually review the outputs for any missing or incomplete responses. We regenerate distractors to ensure uh the complete completeness. So this is a prompt we used uh each prompt include the as I mentioned the question stand correct answer bro bloom's taxonomy level of the question and also the instructional constraints needed to guide the model and for the output it is the required number of distractors for the MCQ for the evaluation process we also followed a three-step procedure. First, we uh shuffered all the distractors generated by both teachers and the GBT model. So, this ensures that when that evaluators they could not distinguish between human and AI generated options. The second we have both human and uh human evaluators and AI independently selected their top K distractors from the entire pool here. So the third step is we aggregate those these selections. So distractors with the highest number of votes were automatically included. We then prioritize prioritized those selected uh by both humans and the AI. So finally the top K the top uh top K ranked options from this combined list were chosen as the final distractors for each MCQ

### Segment 4 (15:00 - 20:00) [15:00]

and at the end we review the sources of the selected distractors. This will allow us to calculate like how many were generated by AI versus teachers and to uh derive the final eval evaluation results. So this is the prompt we used for the evaluation stage. Note that we have used GBD4 for the generation uh for generating the distractors and other AI uh include GBD40 and deepseek were for the distractor evaluation process. So this table summarize the uh characteristics of the AI generated distractors. So the length here refers to how many uh words each uh distractor contains. And the second two uh the second and third column uh indicates the reading difficulty and distractor similarity measures how similar each destructor is to other distractors. and the answer similarity uh captures how close a distractor is to the correct answer. Um so overall this table shows that the destructor the destructor generated by AI vary widely in lengths and reason uh and reading uh difficulty and on average they have very low similarity to each other which means that GBD4 produces diverse options and the answer similarity values also show a broad range which means that some distra factors are qu are quite um close to the correct answer while others are much farther away. So this table uh compels the scores of AI generated distractors and humans created distractors across each bloom's taxonomy level. We can see a pattern here like for uh remember, apply and create. GB4 performs very similar to human teachers. For example, for the OPLY um the score um almost identical like 141 versus 140 and you remember GP4 is slightly higher and for understand analyze and evaluation these three levers human distractors tend to like perform better. Um for example the biggest the gap appears in the um understanding here. Um the score of the GBT is 163 while humans got 202 and uh we also see that more cases where human uh created distractor scores uh are higher uh than um AI in the analyze category. So the the count uh GBD higher and GBD lower columns show how many questions like GBT outperformed or uh lower or underperformed the humans on the distractor uh scores. So now let's look at the results for the programming course. So overall we can see a clear pattern across the bloom's taxonomy levels for lower levels uh such as remember and understanding the performance between human and AI is relatively uh similar and however as we move to high order cognitive levels such as analyzis uh evaluate and create so the gap can become more noticeable. So human generated distractors tend to receive more selections compared to those generated by AI. For example, in the uh like evaluation um category, we see a clear um difference um between the two different uh distractors. So once again, so it shows that uh while the AI performs co uh relatively well on lower level tasks, it still struggs more with generating high quality uh distractors for high order reasoning questions. So this is result for the statistic course. Overall, we observe a similar but more

### Segment 5 (20:00 - 25:00) [20:00]

like pronounced pattern compared to the programming course. So this is the um summary of this first study. So uh one take takeaway here is that while AI performs competitively on lower and midlevel tasks, it steers drugs more with higher order cognitive task that require deeper reasoning. Now let's move on to the second study which focuses on another context K12 AI education and exams how both experts and students perceive these distractors. So in K12 AI education the situation is pretty unique and the field is still challenging uh changing really fast. One big challenge is that we don't have a standard assessment framework yet. So designing the high quality distractors becomes much harder and unlike more estab unlike more established subjects AI education doesn't really have widely used textbooks or consistent curriculum and so in this kind of fastm moving and diverse um learning environment traditional multi-choice question uh design strategies often do not work very well. So this is the previous table I showed. So in this study we extend the prior work by incorporating students perspective perspectives. So we proposed these three research questions in this one. So in this study we explored how experts and students view AI generated distractors. We built a online high school AI course where students they spend about 250 minutes learning sentiment analysis and modeling AI decisions with algebra and then we selected the five multi-choice questions from the first two activities here and generated distractors using generated AI and had both experts and students evaluated them. So these five uh MCQs uh come from the first two activities and here I'm showing one example on the slide. For every question, we made sure that number of AI generated distractors matched the number of instructor written dist distractors. In this example, the correct answer for this question is uh positive and negative. The instructor provided two distractors neutral and mixed and generated AI generated two additional distractors immersional tab and text review. So typically each uh MCQ in this data set consists of two components. a contextual information like this and a corresponding MCAQ item and stu students engage with the contextual materials uh before answering this question and then receive uh automated feedback uh um that is called hints after submitting uh their response by click the check answer button here. So the distractor generation process was similar with the previous study and this time we use the GPD40 for the task. So this is how our prompt looks like for generating the distractors. The key components are the same with the previous study. This is a example of the generated distractors um immersion tab and the corresponding feedback after students submit the response. So this table summarizes um all the five questions um and the last col column are all the distractors uh with feedback and for the distractors with the star means that distractor are from the generative AI. So to answer our uh research question here, we recruited both um domain experts and students. Our uh for the expert group, we recruited four participants from both higher education institutions and industry of them have more than 10 years of experience in educational researches. Um, we asked the experts to complete a Google form and for each multi-choice question, they

### Segment 6 (25:00 - 30:00) [25:00]

were asked to select the top three or top four uh distractors from the full set of distractors. And in addition, they provide qualitative feedback for each distractor explaining the strengths and weakness for uh each distractor. for example, why it was effective or what made it confusing. So for the student group, all the participants were high school students enrolled in our online AI curriculum and instead of using a survey, we collected the log data while they were working on the uh multi-choice questions. This allow us to capture like how they interacted with the questions and we also recorded their final answer for the five target MCQs. So for the AQ1 we exam uh how experts selected AI generated distractors compared to human created ones. Right? So overall the ratio is 11 uh to 14 mean that experts selected uh 11 AI generated distractors and 14 human generated distractors as their top uh choices as show in the figure on the right. We also observe some uh variations across the five questions. the preference patterns are not entirely consistent from one question to another. And for the uh qualitative comments, two key themes we found is that first the experts that emphasize that effective distractors should have a clear connection to the source texts either uh explicitly studied or strongly implied. The second is that beyond simple text matching experts or experts also value cognitive plausibility that is whether a distractor represents a option that a real student might reasonally choose. And now let's look at the student selections. So across all the five questions distractors generatively attracted still higher selection rates. So for example in uh question two the instructor one this one is created by human and it was selected by about 60% of students and similarly in uh question five the human ridden u distractors four had the highest selection rate at uh 71% and these results uh suggested that human reading distractors are often still more effective in capturing students attention and likely because they are uh better aligned with the instructional context. At the same time, AI generated distractors can still be I would say quite competitive in certain cases. For instance, in this question two, question two is a check all that apply question and AI generates distractors uh distractor two, three and for this one were each selected by roughly like 27% of students which means that they are they can also be uh misleading to learners. So this table uh presents the descriptive statistics and results of the uh test for the five uh MCQs. We compare students who selected AI generated distractors on the first attempt. We led as AI group with those who selected human generated distractors. We lab them as the human group. among all the three measures we found a significant like um difference only in the time on t uh in time on the task. So students in the AI group spend uh more time on the question than those in the human group. So the corresponding effect size indicates a small to medium effect. So this figure visualize how students transition between different answers uh choices across the first four attempts on question four. So each note here like represents a selected

### Segment 7 (30:00 - 35:00) [30:00]

option either the correct answer or a distractor that begins with a letter D. So at a giving attempt attempt the links show how students move from one option to another across attempts. So looking at the patterns we so we can see that some um interesting difference for students who initially selected the um AI distractor um the number of uh positive words in the review about half of them reached to the correct answer at the second attempt and similar for those who choose another AI generated ated distractor. Uh the total number of words in the review and of them all of them ar like arrived um of them arrived at the correct answer by the third step here. So in contrast students who initially selected human created distractors they showed more like um disperse the transition patterns with um the last one um students who choose the input of the sentiment analysis model did not like converge uh quickly to the correct answer but instead spread across multi options in the second attempt. So here I summarize the key findings of this study. So first experts showed a slightly preference for human generated distractors and in their evaluations they emphasize uh three main criterias texture relevance sentiment plausibility and alignment with the uh common misunderstandings. And the second is that student selections uh also largely aligned with experts judgments. Human generated distractors were chosen more frequently but AI generated distractors still attracted a few number of students. And third is that students who selected AI generated distractors spend significantly more time on the item. However, we did not observe significant difference in their hint usage or revisiting um learning materials uh behaviors. So, in terms of practical impact, there are also three takeaways. J AI can generate distractors at scale and even support analysis um which saves a lot of time and we already know that. The second is that student choice patterns are really useful because they help us diagnize common u misunderstanding and pinpoint where students struggle. The third is this is uh especially valuable for emerging domains like K12 says and AI education while assessments and curricular materials are still being uh built out being built. So our future work focuses on three directions. The first is developing and releasing a benchmark data set. The second is uh using the error uh explanations as context signals for the um prompter design or for large length model fine-tuning. And the third is to build a integrated tool for distractor generation and evaluation that can be deployed in online learning platforms. Yes. So thank you all for your attention and I welcome any questions. Hi uh thank you so much for the amazing talk. Uh I have a short question. So in the first paper you use lemonstein distance to calculate the similarity between the generated distractor and the uh actual answer. But uh leav distance can have two words that are similar in meaning but they have huge uh distance. So did you use any technique that also uh computed the semantic similarity or uh embedding based similarity? Uh so like uh as a descriptive and exploration of the first study we did not use other more advanced sentiment anal analysis

### Segment 8 (35:00 - 38:00) [35:00]

approaches. So it's only we only used the like uh 11 similarities. — Right. Got it. Yeah. So I was just asking because uh sometimes AI can generate like for example if the correct answer is car and another similar option is automobile. So they will have big distance but they actually mean the same thing. — Yeah. Yeah. I see your point that that is actually a very good suggestion for like future work like for especially for when we how when how we like describe the AI generated uh distractors. Thank you. Thank you for the question. — Okay. — Thank you so much. This was such an informative session and I genuinely really enjoyed it. Uh I was looking at the uh at the chat box during the session but I didn't want to interrupt. So we had a question. Can you please go back to research questions? — Okay. Right. And another question that we have is how do you reconcile disagreements B uh between AI generated uh distractors versus human generated distractors? I would assume — sorry what was the question again? How do you reconcile disagreements between AI generated distractors versus human generated? so in this study we like we collected perspective from both the experts and the students right. So for the experts the main difference they found in the uh AI and the human generated distractors is that more aligned with the course material right because they are more closely related to what is mentioned in the reading materials and for the students perspective I like because students they do not know like which options are from the AI and what are from the teachers. So what we see is more about their behavior engagement like um what are the difference what are the consequences of the first um selection attempt on different uh distractors. — Thank you. so much eong. It was really an honor to have you and very interesting topic. Thank you so much for joining us. — Thank you. Oh, I'm glad to be here. Yeah, thank you.