Li Chen 1,2 , Jing Han 1 , Jing Wang 3 , Yan Tu 2 , Lei Bao 1
Department of Physics, The Ohio State University, Columbus, Ohio 43210, United States
2 Southeast University, Nanjing, Jiangsu , China
3 Eastern Kentucky University, Richmond, KY 40475, United States
*Authors to whom correspondence should be addressed.
Received: 2011-8-23 / Accepted: 2011-12-19 / Published: 2011-12-30
Abstract Item Response Theory (IRT) is a popular assessment method widely used in educational measurements. There are several software packages commonly used to do IRT analysis. In the field of physics education, using IRT to analyze concept tests is gaining popularity. It is then useful to understand whether, or the extent to which, software packages may perform differently on physics concept tests. In this study, we compare the results of the 3-parameter IRT model in R and MULTILOG using data from college students on a physics concept test, the Force Concept Inventory. The results suggest that, while both methods generally produce consistent outcomes on the estimated item parameters, some systematic variations can be observed.For example, both methods produce a nearly identical estimation of item difficulty, whereas the discrimination estimated with R is systematically higher than that estimated with MULTILOG. The guessing parameters, which depend on whether “pre-processing” is implemented in MULTILOG, also vary observably. The variability of the estimations raises concerns about the validity of IRT methods for evaluating students’ scaled abilities. Therefore, further analysis has been conducted to determine the range of differences between the two models regarding student abilities estimated with each. A comparison of the goodness of fit using various estimations is also discussed. It appears that R produces better fits at low proficiency levels, but falls behind at the high end of the ability spectrum.
Research Areas: Learning theory