P206 Artificial Intelligence reproduces central reading for automated scoring of colonoscopies from multiple clinical trials in Ulcerative Colitis.

Gutierrez Becker, B.(1)*;Arús-Pous, J.(1);Yao, H.(2);Luscher, J.(2);Richmond, D.(2);Fisher, E.(3);Bojic, D.(4);Bigorgne, A.(3);Prunotto, M.(3);

(1)Roche, Pharma Research & Early Development. Data and Analytics, Basel, Switzerland;(2)Genentech, AI/ML Research Biology, South San Francisco, United States;(3)Roche, Technology and Translational Research Product Development Immunology- Infectious Diseases and Ophthalmology, Basel, Switzerland;(4)Roche, Product Development Clinical I2O Gastroenterology, Basel, Switzerland;


Using Artificial Intelligence (AI) for the automated scoring of colonoscopy videos  with Mayo Clinic Endoscopic Subscore (MCES) has been successful in several models [1]. However, no study to date has evaluated the performance of such algorithms on a new cohort obtained from a separate clinical trial. In this study, we aimed to build an AI scoring algorithm able to reproduce central reading from three Phase III clinical trials in Ulcerative Colitis (UC).


As a training dataset, still frames were automatically extracted from sigmoidoscopy videos of two Etrolizumab clinical trials: Hickory NCT02100696 [2] and Laurel NCT02165215 [3]. A quality control AI algorithm [4] was applied and the resulting dataset (N=1897 videos) was paired with central reading MCES for each colon section. The training was achieved on a multiclass MCES scoring algorithm [4] with the challenging task of assigning a multiclass score (MCES between 0 - 3) instead of a binary score (MCES <=1 or not). The performance was assessed on 2,292 videos from independent cohorts (Hibiscus I NCT02163759, Hibiscus II NCT02171429 [5], Gardenia NCT02136069 [6]). Videos were pre-processed in an identical manner as those from the training dataset (N= 637, 636 and 1019 videos for Hibiscus I, Hibiscus II and Gardenia, respectively). This dataset constitutes the largest cohort to date for the evaluation of a MCES scoring algorithm.


The Area Under the Receiver Operator Characteristic curve (AUROC) was evaluated independently for each trial. Table 1 summarizes AUROC values per MCES for each clinical trial. The resulting mean AUROC values were 0.79 for Hibiscus I and 0.77 for both Hibiscus II and Gardenia. The average Cohen’s kappa between the automatic measurements and the central readings is 0.40, in contrast to 0.41 between central and local readers.


Our automated MCES scoring algorithm efficiently reproduces central reading in three independent clinical trials cohorts. In addition, algorithm evaluation was performed on data from independent trials, different from the ones used for the development of the algorithm. This is the first study evaluating, on such an extended cohort, the use of AI based automated MCES scoring in UC.

1, Stafford,et al. Inflamm. Bowel Dis 28(10), 1573-1583.
2. Peyrin-Biroule et al. Lancet Gastroenterol. Hepatol. 7.2 (2022): 128-140.
3.  Vermeire et al. Lancet Gastroenterol. Hepatol. 7.1 (2022): 28-37.
4. Gutierrez et al. Ther Adv Gastro Endoscopy 14 (2021).
5. Rubin, et al. Lancet Gastroenterol. Hepatol. 7.1 (2022): 17-27.
6. Danese et al. Lancet Gastroenterol. Hepatol. 7.2 (2022): 118-127.