First Steps with Item Response Theory

Item time!

Dr. Ottavia M. Epifania

Rovereto (TN)

2023-11-17

Item fit


The fit of each item to the model can be evaluated:

\(S - X^2\) (Orland & Thissen, 2000): Statistics based on the \(\chi^2\). If significant, the item does not fit to the model (not suggested)


Root Mean Squared Deviation (RMSD): Difference between what expected under the model and real data (the lower the better).

\(< .15\): acceptable fit of the item to the model

\(< .10\): optimal fit of the item to the model


  • Rasch model only \(\rightarrow\) Infit and outfit statistics

Evaluating item fit I

Show code
data = read.csv("data/itemClass.csv", header = T, sep = ",")
prop_item = data.frame(item = names(colMeans(data[, -c(1:2)])), 
           proportion = colMeans(data[, -c(1:2)]))

ggplot(prop_item, 
       aes(x = reorder(item, proportion), 
           y = proportion), color = item) + geom_bar(stat = "identity") + theme_light() + 
  ylab("Proportion correct") + ylim(0, 1) +
  theme(legend.position = "none", 
        axis.title = element_text(size = 26), 
        axis.title.x = element_blank(), 
        axis.text = element_text(size = 22))
$IC
  Model   loglike Deviance Npars Nobs      AIC      BIC     AIC3     AICc
1  m1pl -5599.390 11198.78    11 1000 11220.78 11274.77 11231.78 11221.05
2  m2pl -5597.621 11195.24    20 1000 11235.24 11333.40 11255.24 11236.10
3  m3pl -5597.738 11195.48    31 1000 11257.48 11409.62 11288.48 11259.52
      CAIC       GHP
1 11285.77 0.5610390
2 11353.40 0.5617621
3 11440.62 0.5628738

$LRtest
  Model1 Model2       Chi2 df         p
1   m1pl   m2pl  3.5389531  9 0.9390614
2   m1pl   m3pl  3.3051431 20 0.9999906
3   m2pl   m3pl -0.2338101 11 1.0000000

attr(,"class")
[1] "IRT.compareModels"
Show code
m1pl = tam.mml(data[, grep("item", colnames(data))], verbose = F)
m2pl = tam.mml.2pl(data[, grep("item", colnames(data))], irtmodel = "2PL", verbose = F)
m3pl = tam.mml.3pl(data[, grep("item", colnames(data))], est.guess = grep("item", colnames(data)), verbose = F)
IRT.compareModels(m1pl, m2pl, m3pl) 
fit_m1pl = tam.modelfit(m1pl, progress = F)
fit_m1pl$statlist 
  X100.MADCOV       SRMR      SRMSR     MADaQ3 pmaxX2
1   0.4318818 0.02178404 0.02800907 0.02874571      1

Evaluating item fit II

item_fit_1pl = IRT.itemfit(m1pl)
str(item_fit_1pl)
List of 11
 $ MD             :'data.frame':    10 obs. of  2 variables:
  ..$ item  : chr [1:10] "item1" "item2" "item3" "item4" ...
  ..$ Group1: num [1:10] -2.21e-06 1.09e-06 -4.23e-06 -4.09e-06 1.97e-06 ...
 $ RMSD           :'data.frame':    10 obs. of  2 variables:
  ..$ item  : chr [1:10] "item1" "item2" "item3" "item4" ...
  ..$ Group1: num [1:10] 0.01389 0.00794 0.01089 0.00595 0.01088 ...
 $ RMSD_bc        :'data.frame':    10 obs. of  2 variables:
  ..$ item  : chr [1:10] "item1" "item2" "item3" "item4" ...
  ..$ Group1: num [1:10] -0.00371 -0.01037 -0.00785 -0.01225 -0.00296 ...
 $ MAD            :'data.frame':    10 obs. of  2 variables:
  ..$ item  : chr [1:10] "item1" "item2" "item3" "item4" ...
  ..$ Group1: num [1:10] 0.01235 0.00612 0.00945 0.00538 0.00908 ...
 $ chisquare_stat :'data.frame':    10 obs. of  2 variables:
  ..$ item  : chr [1:10] "item1" "item2" "item3" "item4" ...
  ..$ Group1: num [1:10] 1.234 0.594 0.667 0.312 1.135 ...
 $ call           : language CDM::IRT.RMSD(object = object)
 $ G              : num 1
 $ RMSD_summary   :'data.frame':    1 obs. of  5 variables:
  ..$ Parm: chr "Group1"
....
item_fit_1pl$chisquare_stat
     item    Group1
1   item1 1.2342630
2   item2 0.5938130
3   item3 0.6668817
4   item4 0.3117211
5   item5 1.1353671
6   item6 0.3983689
7   item7 2.2867280
8   item8 0.4899905
9   item9 1.2476006
10 item10 0.3184401
item_fit_1pl$RMSD_summary 
    Parm          M          SD         Min        Max
1 Group1 0.01071287 0.004171214 0.005948698 0.01983321
item_fit_1pl$RMSD 
     item      Group1
1   item1 0.013892105
2   item2 0.007935321
3   item3 0.010894367
4   item4 0.005948698
5   item5 0.010882772
6   item6 0.008170491
7   item7 0.019833211
8   item8 0.009002479
9   item9 0.013681206
10 item10 0.006888076

Differential Item Functioning

An example



The mass of Iodine 131 decreases by 1/2 every 8 days because of radioactive decay.

In a laboratory, there are 2grams of Iodine 131. How many grams there would be after 16 days?




Correct response 0.5grams

Item Charcteristic Curve (ICC)



The same item presented to two different groups paired for their level of latent trait… does not have the same probability of being endorsed!


The subjects are paired according to their level of the latent trait. Are there any differences in the performance to the item?


Theoretically: Different subjects but with the same level of the latent trait (i.e., paired) should have similar performances on the item


If this expectation is not met \(\rightarrow\) Differential Item Functioning (DIF)

Reference vs. focal group



The comparison to investigate items with DIF is between two groups:


  • Reference group: It is the “baseline” group. For instance, if the test/questionnaire has been validated in a second language, the reference group is the original group

  • Focal group: It is the focus of the DIF investigation, where we suspect the items of the test might be working differently

Uniform DIF

The item “favors” one of the two groups (either the focal or the reference one) constantly

Non-uniform DIF

The item favors one of the group \(\rightarrow\) it is not constant along the latent trait

DIF Investigation

IRT-based methods: Subjects are paired according to the estimates of their latent trait level \(\theta\)

Score-based methods: Subjects are paired to the the observed score

Uniform DIF

1PL

2PL

3PL

Non-uniform DIF

2PL

3PL

Likelihood Ratio Test

Thissen, Steinber, & Wainer (1998)

Two IRT models:

  1. A “no DIF” model \(\rightarrow\) Parameters are constrained to be equal in the two groups

  2. A “DIF” model \(\rightarrow\) Parameters are left free to vary between the two groups

It is like the LRT on a linear model where the group variable (Focal Vs. Reference) is used as a predictor or not

Lord

Lord (1980)

The item parameters are estimated in both the reference and the focal groups

If there is a significant difference in the item estimates between groups \(\rightarrow\) DIF

Beyond significance: Lord’s \(\Delta\):

\(< 1.00\): Negligible DIF

\(1.00 < d < 1.5\): Moderate DIF

\(> 1.5\): High DIF

Raju’s Area

Raju (1988)


It considers the DIF as the area between the ICCs of the items


If the area between the ICCs is 0, then the item presents no DIF


It is based on a \(Z\) statistic under the hypothesis that the area between the ICCs of the item in the two groups is 0

A note on Lord and Raju


The parameters estimated in the focal and reference groups cannot be directly compared \(\rightarrow\) The parameters in one of the groups must be rescaled.

The rescaling can be done according to the equal means anchoring method (Cook & Eignor, 1991)

This method is already applied in the difR package

Equal means anchoring

Cook & Eignor (1991)

First, a constant must be computed:

\[c = \bar{b}_R - \bar{b}_F\]

Then, it is subtracted from the estimates of the items in the reference group:

\[b' = b_{Ri} - c\]

DIF – Import data

The dataset is the difClass.csv data set

Show data & code
data = read.csv("data/difClass.csv", header = T, sep = ",")
head(data)
  id gender item1 item2 item3 item4 item5 item6 item7 item8 item9 item10
1  1      m     0     1     0     1     1     0     0     1     1      0
2  2      m     0     1     0     0     1     0     0     0     0      0
3  3      m     0     1     1     1     1     1     0     1     1      0
4  4      m     1     0     0     1     1     0     0     0     1      0
5  5      m     1     1     0     0     1     1     1     1     0      0
6  6      m     0     0     0     0     1     0     0     0     0      0

Show code
long = pivot_longer(data, !1:2, names_to = "item", 
             values_to = "correct")
prop_gender = long %>% 
  group_by(item, gender) %>%  
  summarise(prop = mean(correct), sd = sd(correct))

ggplot(prop_gender, 
       aes( x = item, y= prop, fill = gender)) + geom_bar(stat = "identity", position = position_dodge()) + ylim(0,1)
m1pl = tam.mml(data[, grep("item", colnames(data))], verbose = F)
m2pl = tam.mml.2pl(data[, grep("item", colnames(data))], irtmodel = "2PL", verbose = F)
m3pl = tam.mml.3pl(data[, grep("item", colnames(data))], est.guess = grep("item", colnames(data)),
                     verbose = F)
IRT.compareModels(m1pl, m2pl, m3pl)
$IC
  Model   loglike Deviance Npars Nobs      AIC      BIC     AIC3     AICc
1  m1pl -5599.390 11198.78    11 1000 11220.78 11274.77 11231.78 11221.05
2  m2pl -5597.621 11195.24    20 1000 11235.24 11333.40 11255.24 11236.10
3  m3pl -5597.736 11195.47    31 1000 11257.47 11409.61 11288.47 11259.52
      CAIC       GHP
1 11285.77 0.5610390
2 11353.40 0.5617621
3 11440.61 0.5628736

$LRtest
  Model1 Model2       Chi2 df         p
1   m1pl   m2pl  3.5389531  9 0.9390614
2   m1pl   m3pl  3.3079257 20 0.9999905
3   m2pl   m3pl -0.2310274 11 1.0000000

attr(,"class")
[1] "IRT.compareModels"
fit_m1pl = tam.modelfit(m1pl, progress = F)
fit_m1pl$statlist
  X100.MADCOV       SRMR      SRMSR     MADaQ3 pmaxX2
1   0.4318818 0.02178404 0.02800907 0.02874571      1
item_fit_1pl = IRT.itemfit(m1pl)
item_fit_1pl$RMSD_summary
    Parm          M          SD         Min        Max
1 Group1 0.01071287 0.004171214 0.005948698 0.01983321
 fit_m1pl$Q3_summary
  type             M         SD         min        max      SGDDM     wSGDDM
1   Q3 -6.608155e-02 0.03590523 -0.14242523 0.01903314 0.06742440 0.06742440
2  aQ3  3.468830e-18 0.03590523 -0.07634368 0.08511468 0.02874571 0.02874571

DIF – LRT

This is not properly the LRT but an approximation.

est_theta = IRT.factor.scores(m1pl)$EAP
lrt_dif = difGenLogistic(data[, !colnames(data) %in% c("id", "gender")],
                           group = as.factor(data$gender), focal.names = "f",
                           type = "udif",
                           alpha = .001, p.adjust.method = "BH",
                           match = est_theta,
                           criterion = "LRT")

Detection of uniform Differential Item Functioning
using Logistic regression method, without item purification
and with LRT DIF statistic

Matching variable: specified matching variable 
 
No set of anchor items was provided 
 
Multiple comparisons made with Benjamini-Hochberg adjustement of p-values 
 
Logistic regression DIF statistic: 
 
       Stat.   P-value Adj. P     
item1   9.5769  0.0020  0.0033 ** 
item2   7.8345  0.0051  0.0064 ** 
item3   2.8907  0.0891  0.0891 .  
item4   6.6418  0.0100  0.0111 *  
item5  28.1792  0.0000  0.0000 ***
item6  36.3876  0.0000  0.0000 ***
item7  18.8507  0.0000  0.0000 ***
item8   9.8524  0.0017  0.0033 ** 
item9  27.3310  0.0000  0.0000 ***
item10  8.9442  0.0028  0.0040 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  

Detection threshold: 10.8276 (significance level: 0.001)

Items detected as uniform DIF items:
      
 item5
 item6
 item7
 item9

 
Effect size (Nagelkerke's R^2): 
 
Effect size code: 
 'A': negligible effect 
 'B': moderate effect 
 'C': large effect 
 
       R^2    ZT JG
item1  0.0092 A  A 
item2  0.0086 A  A 
item3  0.0030 A  A 
item4  0.0069 A  A 
item5  0.0368 A  B 
item6  0.0394 A  B 
item7  0.0231 A  A 
item8  0.0098 A  A 
item9  0.0293 A  A 
item10 0.0130 A  A 

Effect size codes: 
 Zumbo & Thomas (ZT): 0 'A' 0.13 'B' 0.26 'C' 1 
 Jodoin & Gierl (JG): 0 'A' 0.035 'B' 0.07 'C' 1 

 Output was not captured! 

DIF – Raju

rajDif = difRaju(data[, !colnames(data) %in% c("id")], 
                  group = "gender",  focal.name = "f", 
                  model = "1PL", 
                  alpha = .001, p.adjust.method = "BH")

Detection of Differential Item Functioning using Raju's method 
with 1PL model and without item purification

Type of Raju's Z statistic: based on unsigned area 
 
Engine 'ltm' for item parameter estimation 
 
Common discrimination parameter: fixed to 1

No set of anchor items was provided 
 
Multiple comparisons made with Benjamini-Hochberg adjustement of p-values 
 
Raju's statistic: 
 
       Stat.   P-value Adj. P     
item1  -2.7358  0.0062  0.0104 *  
item2  -2.5503  0.0108  0.0154 *  
item3  -1.5133  0.1302  0.1302    
item4  -2.2916  0.0219  0.0244 *  
item5   4.4742  0.0000  0.0000 ***
item6   5.0150  0.0000  0.0000 ***
item7   3.6098  0.0003  0.0008 ***
item8   2.4605  0.0139  0.0173 *  
item9  -4.6929  0.0000  0.0000 ***
item10 -2.9145  0.0036  0.0071 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  

Detection thresholds: -3.2905 and 3.2905 (significance level: 0.001)

Items detected as DIF items: 
      
 item5
 item6
 item7
 item9

Effect size (ETS Delta scale): 
 
Effect size code: 
 'A': negligible effect 
 'B': moderate effect 
 'C': large effect 
 
       mF-mR   deltaRaju  
item1  -0.4190  0.9846   A
item2  -0.4244  0.9973   A
item3  -0.2458  0.5776   A
item4  -0.3677  0.8641   A
item5   0.9066 -2.1305   C
item6   0.8568 -2.0135   C
item7   0.6592 -1.5491   C
item8   0.3799 -0.8928   A
item9  -0.7692  1.8076   C
item10 -0.5764  1.3545   B

Effect size codes: 0 'A' 1.0 'B' 1.5 'C' 
 (for absolute values of 'deltaRaju') 
 
Output was not captured! 

DIF – Lord

lordDif = difLord(data[, !colnames(data) %in% "id"], 
                  group = "gender",  focal.name = "f", 
                  model = "1PL", 
                   alpha = .001, p.adjust.method = "BH")

Detection of Differential Item Functioning using Lord's method 
with 1PL model and without item purification

Engine 'ltm' for item parameter estimation 
 
Common discrimination parameter: fixed to 1

No set of anchor items was provided 
 
Multiple comparisons made with Benjamini-Hochberg adjustement of p-values 
 
Lord's chi-square statistic: 
 
       Stat.   P-value Adj. P     
item1   7.4848  0.0062  0.0104 *  
item2   6.5039  0.0108  0.0154 *  
item3   2.2901  0.1302  0.1302    
item4   5.2515  0.0219  0.0244 *  
item5  20.0188  0.0000  0.0000 ***
item6  25.1505  0.0000  0.0000 ***
item7  13.0309  0.0003  0.0008 ***
item8   6.0539  0.0139  0.0173 *  
item9  22.0232  0.0000  0.0000 ***
item10  8.4941  0.0036  0.0071 ** 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  
 
Items detected as DIF items: 
      
 item5
 item6
 item7
 item9

Effect size (ETS Delta scale): 
 
Effect size code: 
 'A': negligible effect 
 'B': moderate effect 
 'C': large effect 
 
       mF-mR   deltaLord  
item1  -0.4190  0.9846   A
item2  -0.4244  0.9973   A
item3  -0.2458  0.5776   A
item4  -0.3677  0.8641   A
item5   0.9066 -2.1305   C
item6   0.8568 -2.0135   C
item7   0.6592 -1.5491   C
item8   0.3799 -0.8928   A
item9  -0.7692  1.8076   C
item10 -0.5764  1.3545   B

Effect size codes: 0 'A' 1.0 'B' 1.5 'C' 
 (for absolute values of 'deltaLord') 
 
Output was not captured! 

Details on Raju and Lord

mF-mR: Difference between the focal and reference group (the estimates in the reference group are already rescaled)

deltaLord/deltaRaju: effect-size, it is obtained by multiplying mF-mR \(\times -2.35\) (Penfield & Camilli, 2007)

The size of the effect can be interpreted according to the values reported in the R output

The item parameters can be obtained as:

lordDif$itemParInit

Caution advised!

The object obtained from lordDif$itemParInit has a number of rows equal 2 times the number of items:

Rows \(i, 1\ldots, I\): Estimates of the items in the REFERENCE GROUP

Rows \(i+1, \ldots 2I\): Estimates of the items in the FOCAL GROUP

Rows \(i, 1\ldots, I\): These estimates are not rescaled

item_par = lordDif$itemParInit
item_par[1:10, ]
                b     se(b)
Item1   0.1003314 0.1081049
Item2  -1.0813481 0.1167024
Item3   0.9287588 0.1141034
Item4   0.8852199 0.1135280
Item5  -2.5345451 0.1625745
Item6   0.5418750 0.1100347
Item7   1.0626916 0.1160698
Item8  -0.4051660 0.1092741
Item9  -0.7726385 0.1124324
Item10  2.1299052 0.1442241

Rows \(i+1, \ldots 2I\)

item_par[11:nrow(item_par), ]
                   b     se(b)
Item1   1.621222e-06 0.1084637
Item2  -1.187073e+00 0.1186087
Item3   1.001565e+00 0.1156194
Item4   8.361141e-01 0.1134093
Item5  -1.309367e+00 0.1209242
Item6   1.717348e+00 0.1307057
Item7   2.040499e+00 0.1409710
Item8   2.933326e-01 0.1090660
Item9  -1.223203e+00 0.1192650
Item10  1.872148e+00 0.1353199

Rescaling – Practice

itemFR = data.frame(cbind(item_par[11:nrow(item_par), ], item_par[1:10, ]))
colnames(itemFR)[c(1,3)] = paste0(rep("b",2),  c("F", "R"))
itemFR$constant = mean(itemFR$bR) - mean(itemFR$bF)
itemFR$new_bR = itemFR$bR - itemFR$constant
itemFR$DIF_correct = itemFR$bF- itemFR$new_bR
itemFR
                  bF     se.b.         bR   se.b..1   constant      new_bR
Item1   1.621222e-06 0.1084637  0.1003314 0.1081049 -0.3186282  0.41895953
Item2  -1.187073e+00 0.1186087 -1.0813481 0.1167024 -0.3186282 -0.76271998
Item3   1.001565e+00 0.1156194  0.9287588 0.1141034 -0.3186282  1.24738700
Item4   8.361141e-01 0.1134093  0.8852199 0.1135280 -0.3186282  1.20384806
Item5  -1.309367e+00 0.1209242 -2.5345451 0.1625745 -0.3186282 -2.21591698
Item6   1.717348e+00 0.1307057  0.5418750 0.1100347 -0.3186282  0.86050314
Item7   2.040499e+00 0.1409710  1.0626916 0.1160698 -0.3186282  1.38131979
Item8   2.933326e-01 0.1090660 -0.4051660 0.1092741 -0.3186282 -0.08653783
Item9  -1.223203e+00 0.1192650 -0.7726385 0.1124324 -0.3186282 -0.45401035
Item10  1.872148e+00 0.1353199  2.1299052 0.1442241 -0.3186282  2.44853338
       DIF_correct
Item1   -0.4189579
Item2   -0.4243530
Item3   -0.2458222
Item4   -0.3677340
Item5    0.9065505
Item6    0.8568448
Item7    0.6591791
Item8    0.3798704
Item9   -0.7691922
Item10  -0.5763855