چکیده:
Since scoring oral language proficiency is performed by raters, they are an essential part of performance assessment. One important feature of raters is their teaching and rating experience which has attracted considerable attention. In a majority of previous studies on rater training, extremely severe or lenient raters, benefited more from training programs and thus results of this training showed significant severity/leniency reduction in their rating behavior. However, they mostly investigated the application of FACETS on only one or two facets and few have used a pre, post-training design. Besides, empirical studies have reported contrasting outcomes, not showing clearly which group of raters does rating more reliably than the other. In this study, 20 experienced and inexperienced raters rated the oral performances produced by 200 test-takers before and after a training program. The results indicated that training leads to higher measures of interrater consistency and reduces measures of biases towards using rating scale categories. Moreover, since it is almost impossible to completely eradicate rater variability even if training is applied, rater training procedure had better had better be regarded as a procedure to make raters more self-consistent (intrarater reliability) rather than consistent with each other (interrater reliability). The findings of this study indicated that inexperienced and experienced raters’ rating quality improved after training; however, inexperienced raters underwent higher consistency and less bias. Hence, there is no evidence that inexperienced raters should be excluded from rating solely because of their lack of adequate experience. Moreover, Inexperienced raters, being more economical than the experienced ones, cost less for decision-makers for rating. Therefore, instead of charging a bulky budget on experienced raters, decision-makers had better use the budget for establishing better training programs.
خلاصه ماشینی:
In this study, 20 experienced and inexperienced raters rated the oral performances produced by 200 test-takers before and after a training program.
Keywords: Bias; Interrater consistency; Intrarater consistency; Multi- faceted Rasch measurement (MFRM); Rater expertise Introduction In scoring second language speaking performance, rater variability has been identified as a potential source of measurement error which might interfere with the measurement of test-takers’ true speaking ability (McNamara, 1996).
A variety of studies on experienced and inexperienced raters’ performances have indicated higher inter-rater consistency following training (Ahmadi & Sadeghi, 2016; Attali, 2016; Bijani & Fahim, 2011; Cumming, 1990).
In a study by Bijani (2010) on the effect of rater training on raters’ inconsistency in scor- ing test-takers’ written language proficiency, the consistency of inexperienced raters improved much more after training compared to experienced raters.
Ahmadi and Sadeghi (2016) studied a group of experienced and inexperi- enced language teachers and provided both with one-day training in rating oral performance tests.
Table 1 displays the raters’ measurement report for New and Old raters along with the chi-square results and the significance level at the pre-training phase.
At the pre-training phase, New raters had a more significant bias towards test-takers than Old ones (337 to 307).
Bias Interaction Frequency between Raters at Each Expertise Level and Test-takers for Various Bias Logit Range (Pre-Training) (View the image of this page) In order to study the rater group biasedness to each particular category of the rating scale, a bias interaction analysis with regard to each category was performed.