Fernandez C1,2 • Rusk S1,2 • Glattard N1,2 • Hensen B2, Shokoueinejad M3 • Creado S2 • Hungerford J4
Inter-scorer variability is a challenge in sleep medicine. The inter-scorer reliability (ISR) program aids clinicians in assessment of inter-scorer variability and accreditation. Leave-one out-cross-validation (LOOCV) is a powerful technique borrowed from machine learning for evaluating how well a statistical analysis will generalize to an independent dataset. In the present study, we adapt the LOOCV approach to ISR analysis, proposing a novel application of the methodology, whereby we introduce the concept of overfitting in the sleep scoring context and characterize its impact on ISR assessment reproducibility.
A cohort (N=72) was selected using stratified sampling with proportionate allocation to control for sleep apnea severity, medical conditions, medications, and demographic factors. The cohort was scored by four independent sleep technologists (RPSGT). ISR was assessed using epoch-by-epoch agreement for sleep stages, respiratory, arousal, and movement events under two LOOCV settings. First, average agreement was calculated with each clinician as the designated reference scorer (DRS) compared to each of the three “held out” clinicians. Second, average agreement was calculated by constructing a DRS based on events that a 2/3 majority of clinicians agree with compared to the fourth “held out” clinician.
Across all event types, 42–60% of all event-epochs were marked with the presence of an event by only one of four clinicians. All four clinicians agreed on 6–14% of all event-epochs evaluated. No statistically significant differences were observed in the percentage of event-epochs marked by 2/3 majority versus the percentage of event-epochs marked by clinicians individually. Comparatively, the observed agreement estimates were greater for all event types in the 2/3 majority setting than the individual setting.
Cross-validation presents an opportunity to improve the generalization of agreement estimates in ISR assessments. This work demonstrates that consensus-based DRS’s may be constructed and used for ISR assessments. Given the substantial percentage of epochs that were marked by a single clinician, utilizing a consensus-based reference can serve to regularize overfit scenarios where inter-scorer variability would be amplified by artifacts in an individual’s scoring. Therefore, cross-validation approaches may enable measurement of scoring agreement with greater reproducibility.