P027 Machine learning approaches to identify prognosis indicators from microbiome data
M. Madgwick*1,2, P. Sudhakar1,2,3, T. Korcsmáros1,2
1Earlham Institute, Norwich, UK, 2Quadram Institute, Norwich, UK, 3KU Leuven Department of Chronic Diseases, Leuven, Belgium
Inflammatory bowel disease (IBD) has been shown to associate with alterations in intestinal microbiome. However, the precise nature of these microbial changes remains unclear. With the vast number of microbes present within the gut, novel and powerful computational techniques are required to distinguish between important microbial changes and noise. Machine learning (ML) allows for a data-driven approach to identify these discrete dynamic changes within the microbiome, while systems biology (SB) gives mechanisms to the findings of the ML algorithms. By combining ML and SB approaches, we aim to characterise key microbial factors in ulcerative colitis (UC) pathogenesis.
Interpreting the functional and mechanistic importance of microbiome features requires higher resolution than 16S rRNA sequencing. However, the lack of Whole Genome Shotgun (WGS) data at a scale required for ML-based classification is a bottleneck. To overcome this and to develop the ML pipeline, we generated a large artificial patient cohort using the SMOTE algorithm to oversample a small UC WGS cohort. The artificial dataset was created by preserving the complexity and distribution functions observed in real WGS datasets. This generated enough samples to be able to train a deep learning model. We utilised the power of Artificial Neural Networks (ANNs) to obtain discrete underlying data structures from the microbiome data, thus eliminating noise from the feature space. Dynamic changes within the patient's microbiome are predicted by employing a heterogeneous ensemble (Random Forest, Gradient Boosting, etc.) to match the complexity of underlying relations of the microbiome.
Using our ANN to encode the data, we identified potential candidate prognosis indicators from this artificial dataset. The ML pipeline was able to recover top-performing features from the synthetic dataset, thus determining the underlying structure of the dataset. As a next step, we have collected and interrogated publicly available microbiome data (NIH Integrative Human Microbiome Project) to enable the ML model to be applicable to actual UC cohorts.
We have developed an integrated ML-based microbiome pipeline to identify prognostic indicators for UC from artificial data. Furthermore, using SB approaches, we were able to interpret the predicted key microbial features and communities by inferring connections between microbial and host proteins relevant in UC. This pipeline will enable us to analyse and assess real UC patient microbiome data, and identify prognostic indicators for disease subtypes and personalised treatments.