Getting started with Automated Scoring

Getting started with Automated Scoring

This blog article from Sue Lottridge of Cambium Assessment, who authored the Chapter in the TBA Guidelines on Automated Scoring of Constructed Response Items introduces automated scoring for those considering it for the first time.


When assessment programs first think about incorporating automated scoring into their processes, it can feel both intimidating and scary. And, truthfully, there is a lot to consider. The ITC/ATP Guidelines for Technology Based Assessment offer excellent advice around the automated scoring of constructed response items. Even with guidelines, it can be helpful to think through key considerations as you embark on this journey. As someone who supports clients as they incorporate automated scoring in their assessment programs, I discuss four key considerations below. Hopefully, you will find this to be a useful companion to the Guidelines!

Context. First and foremost, considering your own context is a critical first step. For example, what is the appetite of the leadership, test-takers, and others around the use of artificial intelligence in the scoring process? Folks vary greatly in their perceptions of AI and its use; considering this up front can help to identify concerns and opportunities that can shape your general approach. Second, is your program well-established in terms of its item and test development, scoring, and reporting? New programs can be great opportunities to start with automated scoring but can offer also challenges in that item and rubric design and scoring may not be fully baked, test-takers may be unfamiliar with the items and not yet able to fully demonstrate their knowledge and skills, and there simply may not be enough data to use for engine training. More established programs may run into issues around legacy policies and procedures that can be difficult to change, but they can offer large item banks and datasets on which to train the engine. Finally, your items need to be amenable to automated scoring in terms of design, response capture, and hand-scoring quality. Consulting colleagues, research literature, or experts in automated scoring can help you determine whether it is appropriate for your program.

Rationale. Next, it is critical to outline your rationale for using automated scoring. Automated scoring is almost always driven by a need to reduce cost, as a way to retain open-ended items on a test, and to ensure scoring consistency within and across test administrations. Faster reporting is often also a reason; however, in summative contexts, the incorporation of automated scoring in a program may not reduce reporting times for all test takers. This is because these systems are usually embedded in a hybrid human/automated scoring model that requires some amount of human scoring (often 20-30%). The necessary inclusion of human scoring introduces a delay in reporting for those responses routed for human review or for all responses if a program seeks to release scores at the same time. Having your rationale front and center can help to shape decisions ranging from tactical ones such as the percent of responses routed for human scoring to more strategic ones that provide users with the rationale for its use.

Understanding your items, rubrics, and scoring procedures. Having a deep understanding of your items, rubrics, and how they are scored serves as the foundation for the choices you make around automated scoring. For example, what condition codes (e.g., flags indicating responses that do not meet the minimal criteria to be eligible for rubric scoring) are you using in your program? What is your hand-scoring design in terms of single reads, second reads, and what statistics do you use to measure how reliable your hand-scoring process is? How are final scores assigned to responses for items? What data do you have available to train the engine? When you work with an automated scoring vendor, you will need to have this information on hand to be able to ask and make informed decisions around what flags the engine uses to identify condition codes and how they align with yours, what benchmarks and criteria you will use to evaluate the engine, and how you set up your hybrid automated/human scoring design.

Expect to collaborate. Adding automated scoring into the scoring design adds complexity to the overall process. There are a lot of decisions to make – both big and small – that require collaboration with the automated scoring experts. These decisions will occur throughout the assessment planning, implementation, and wrap-up phases. Decisions made during the preparation phase can include which items to score, which data to use to train the engine, studies to evaluate performance, condition codes choices, hybrid scoring designs, and communication with internal and external folks. Upon deployment, the decisions will involve monitoring the hybrid scoring process, resolving any issues that arise, and being able to answer and address questions from the field. After scoring, technical reports and presentations to technical advisory committees and others will need to be written and presented.

If this feels overwhelming, you are not alone! Incorporating automated scoring is not as easy as just training and deploying models; it necessarily – and not surprisingly – involves careful consideration given that we want our tests to be as fair, accurate, and reliable as possible. Best of luck! 



Share this post:

Comments on "Getting started with Automated Scoring "

Comments 0-5 of 0

Please login to comment