Beyond Words - Language Blog

Multiple-Choice Test Development 101

We’ve all had the experience of taking a multiple choice test — one is given a question, and then has to choose the correct answer from a group of choices, usually “a,” “b,” “c,” and “d.”

From a test-taker’s perspective, these can sometimes be very intimidating. Does the following line of reasoning sound familiar?

Okay, the correct answer is there somewhere, and “a” looks pretty good — but wait, so does “c.” But “c” was the answer to the last three questions. Would they have made “c” the correct answer to four questions in a row? I doubt it. So maybe it’s “a.” But it could be “c” also. I just don’t know…It’s definitely not “d”, but “b” is also looking like a possibility now. Yeah, “a” could be right…

While test-takers have the difficult job of actually taking the test, the test-developers have a difficult task as well. Certainly, they are concerned with how the test questions are written, but another chief concern is how the answer choices perform. This includes all of them — the right ones and the wrong ones.

Each test item is made up of three parts: the stem (the question), the key (the correct answer choice), and the distractors (the incorrect answer choices). The key and the distractors are sometimes referred to collectively as the alternatives. The idea behind each item is to present a question that the test-taker, having the requisite level of knowledge about the particular skill area, would answer correctly when given a collection of possible answers. A test-taker who has less than the requisite level of knowledge about what’s being tested might choose the right answer to a question (sometimes by guessing), but there is some likelihood that he or she will select one of the incorrect distractors — if the question and answers are created properly, that is. Simply put, each question has the job of distinguishing between test-takers who have the level of knowledge needed to pass the test and those who do not.

When writing each question, test-developers need to ensure that there is one — and only one — best answer to each question. But they also need to ensure that the distractors are doing their job of, well, distracting the less-skilled test-takers. Each distractor must be a plausible option; otherwise it does not serve this purpose. Take a look at the following example:

1. According to the passage, the Earth’s core contains levels of which elements?

a. nickel, iron, and gold
b. iron, sulfur, and carbon
c. oxygen, lithium, and neon
d. red, white, and blue

In this example, option “d” is clearly ridiculous and could be quickly eliminated, giving less-skilled test-takers a greater probability of selecting the correct option. Therefore, this answer choice is not doing its job of helping to distinguish among candidates. The test-developer would want an alternative that followed the form of the other three options.

So how does the developer know how well the answer choices are performing?

After the administration of a test, the responses can be analyzed to see what percentage of candidates selected each option. The analysis is performed to determine a figure that test developers call Item Facility (IF).

The IF is used to quantify the difficulty presented by each test answer choice. It is equal to the number of correct answers divided by the number of respondents. An IF value of 1.00 means that all respondents got the question correct, and therefore the item may be too easy. Conversely, an IF of 0.00 mean no one got the question correct, and that the item may be too difficult. In norm-referenced testing, items with extreme values would be considered of little value for distinguishing performance and would be removed or modified, as this type of testing is designed to spread out examinees’ scores across a normal distribution.

The ideal IF range for these test items is between a 0.4 and a 0.6, meaning that the majority of test-taker’s scores are clustered under the “bell-shaped curve.” However, criterion-referenced testing has to do with the mastery of the skill or subject matter being tested, and in these cases, it is common to see more extreme IF values.

From this type of analysis, among others, test-developers can go back and improve the quality of the items so that the whole test performs well in determining who has the skills to succeed at whatever the test is measuring.