Abstract
One of the scopes of Systems Biology is to propose mathematical models that best capture the dynamic behavior of intra-cellular processes. In this regard, the last two decades have brought up a shift in the field, with technological advances now allowing researchers to access a wide range of high-throughput technologies at an affordable cost. These techniques allow to simultaneously interrogate thousands of variables, such as genome-wide transcriptomics and proteomics. However, parallel to these technological advances, there is a growing need for mathematical models that are suited to integrate measurements obtained from different cellular processes.
In this thesis we aim to model combinations of three commonly used high-throughput data: epigenetic (namely ATAC-seq and DNA methylation), transcriptomic (RNA-seq) and proteomic data (MASS-spectrometry). In the first work we analyze paired ATAC-seq and RNA-seq data to integrate measurements of (i) chromatin openness, (ii) transcription factors (TFs) availability and (iii) gene expression. To model these data, we use elementary causal motifs, a class of mathematical models which is suited to represent causal interactions between three nodes. Indeed, our analysis shows that the elementary causal motifs in the data are enriched for biologically relevant TF-gene interactions. Moreover, a significant overlap is observed between the causal motifs identified in datasets representing similar cell stimuli, suggesting that causal motifs represent a robust biological signal.
This work is then extended to include another class of high-throughput data: MASS-spectrometry. More precisely, we propose a framework to model the flow of events that goes from chromatin remodeling to splice variants expression, and from splice variants to protein synthesis. As the underlying graph becomes more complex than the previous case, a more general mathematical framework is considered: Bayesian networks. Interestingly, this work shows that most putative associations between chromatin regions, splice variants and proteins that have been gathered by scientific community so far, are supported by the data. Moreover, similarly to the previous work, the causal interactions identified in the data highlight relevant biological features; more precisely, causal chains between chromatin regions, splice variants and proteins are enriched for splice variants that have a major role in protein synthesis.
From a technical point of view, causal motifs are characterized by a property known as conditional independence, which can be used to identify causal interactions in the data. However, particularly when the data available is limited, it is challenging to assess conditional independencies in the data. It is therefore of interest to investigate the existence of properties that allow us to predict conditional independence. In particular, in our work we propose two properties: structural balance and inverse balance, which are closely connected to what is known in the literature as positive association and multivariate total positivity of order 2 (MTP2), respectively. Our analysis shows that both heuristics are useful in predicting conditional independence, both from a theoretical perspective and in experimental data.
Lastly, a network-based approach is used to integrate DNA methylation and RNA-seq in a case-control study centered around multiple sclerosis, in order to identify common regulatory patterns in DNA methylation and gene expression during the course of pregnancy. The strategy is based on the rationale that proteins that are interconnected in the protein-protein network are more likely to be involved in similar cellular functions. Indeed, the analysis highlights that similar pathways are altered at epigenetic and transcriptomic level, leading to a set of genes that are likely involved in the modification of the disease symptoms that is observed during pregnancy.