Clinical Validation of AI Scoring in Adult and Pediatric Clinical PSG Samples Compared to Prospective, Double-Blind Scoring Panel

Chris R. Fernandez, MS¹ • Sam Rusk, BS¹ • Yoav N. Nygate, MS¹ • Nick Glattard, MS1• Fred Turkington, BS1 • Nathaniel Watson, MD, MSc²

Introduction

Despite an appreciable rise in sleep wellness and sleep medicine A.I. research publications, public data corpuses, institutional support, and health A.I. research funding opportunities, the availability of controlled-retrospective, hybrid-retrospective-prospective, and prospective-RCT quality clinical validation study evidence is limited with respect to their potential clinical impact.

Furthermore, only a few practical examples of A.I.technologies are validated, in use today clinically, and widely adopted, to assist in sleep diagnoses and treatment.

In this study, we contribute to this growing body of clinical A.I. validation evidence and experimental design methodologies with an interoperable A.I. scoring engine in Adult and Pediatric populations.

Methods

Stratified random sampling with proportionate allocation was applied to a database of N>10,000 retrospective diagnostic clinical polysomnography (PSG), selected by evidence grading standards, with controls applied for OSA severity, diagnoses; sleep, psychiatric, neurologic, neurodevelopmental, cardiac, pulmonary, metabolic disorders, medications; benzodiazepines, antidepressants, stimulants, opiates, sleep aids, demographic groups of interest; sex, adult age, pediatric age, BMI, weight, height, and patient-reported sleepiness, to establish representative N=100 Adult and N=100 Pediatric samples.

Double Blinded scoring was prospectively collected for each sample by 3 experienced RPSGT certified sleep technologist randomized from a pool of 9 scorers.

Sensitivity (PA), Specificity (NA), Accuracy (OA), Kappa (K), and 95% Bootstrap CI’s are presented for sleep stages, OSA/CSA, hypopnea 3%/4%, arousals, limb movements, Cheyenne-Stokes respiration, periodic breathing, atrial fibrillation, and other events, and normative, mild, moderate, and severe OSA categories for global-AHI and REM-AHI.

Results for Sleep Staging and OSA Severity Diagnostic Accuracy are summarized.

Results

A.I. scoring performance met but in most cases exceeded initial clinical validation study (N=72 Adults, 2017) PA, NA, OA, K point-estimates and confidence-interval results for the 26 event types and 8 AHI-categories evaluated.

The Adult sample showed 87%/94% Sensitivity/Specificity across all stages (Wake/N1/N2/N3/REM)and 94%/96% Sensitivity/Specificity for AHI>=15.

The Pediatric sample showed 87%/93% Sensitivity/Specificity Staging, 89%/98% Sensitivity/Specificity AHI>=15.

Observed Accuracy was >90% for Adults and Pediatrics all 26 events and 7 AHI-categories analyzed, except REM-AHI>=5 (85%/82% Adults/Pediatrics).

Conclusion

We provide clinical validation evidence that demonstrates interoperable A.I. scoring performance in representative Adult and Pediatric patient clinical PSG samples when compared to prospective, double-blind scoring panel.

1 EnsoData Research, 2 Department of Neurology, University of Washington School of Medicine