Network Analysis in Systems Epidemiology
Article information
Abstract
Traditional epidemiological studies have identified a number of risk factors for various diseases using regression-based methods that examine the association between an exposure and an outcome (i.e., one-to-one correspondences). One of the major limitations of this approach is the “black-box” aspect of the analysis, in the sense that this approach cannot fully explain complex relationships such as biological pathways. With high-throughput data in current epidemiology, comprehensive analyses are needed. The network approach can help to integrate multi-omics data, visualize their interactions or relationships, and make inferences in the context of biological mechanisms. This review aims to introduce network analysis for systems epidemiology, its procedures, and how to interpret its findings.
INTRODUCTION
Epidemiology contributes to the identification of risk factors for various diseases. However, conventional (traditional) analyses in epidemiology use regression, which examines the association between an exposure and an outcome as a one-to-one correspondence. This approach has a major limitation (referred to as the “black-box” nature of the analysis) in that it cannot fully explain complex relationships such as biological pathways [1-3].
To reveal the mechanisms previously hidden in the “black box,” a new framework has emerged: systems epidemiology. Laszlo and Krippner [4] defined a “system” as “a complex of interacting components together with the relationships among them that permit the identification of a boundary-maintaining entity or process” through systems theories. “Systems epidemiology” is a concept derived from “systems biology,” which is a holistic and integrated approach to understand complex biological processes and phenotypes, and has been defined as a new integrative approach in human studies using high-throughput multi-omics data [2,5-7]. Subsequently, Dammann et al. [8] defined systems epidemiology as “an epidemiologic approach to identify risk factors including systems-level (such as omics-level) exposure measurements at multiple levels, for instance, socio-demographic, clinical, or biological levels via network analyses of interrelationships among risk factors and computational simulation of risk scenarios in parallel to data-driven biostatistical risk modeling”.
As omics techniques have been developed, high-throughput data have become available for current epidemiological studies. Various levels of omics data include genomics, transcriptomics, proteomics, metabolomics, and microbiome data [6]. These data types have tens to hundreds of thousands of variables. However, until recently, many studies have performed simple regression-based analyses, such as by using genome-wide data or metabolomics data and then adjusting for multiple corrections. It is well known that disease does not occur independently as a result of a single factor. To conduct a comprehensive analysis in terms of systems epidemiology, an alternative approach is needed. The network approach could help to integrate multi-omics data, visualize their interactions or relationships, and make inferences in the context of biological mechanisms [9,10].
NETWORK STRUCTURE, VISUALIZATION, AND ANALYSIS
A network is a structural and graphical form consisting of nodes that indicate variables and edges that represent the relationships between the variables. Nodes are also referred to as vertices, and edges are also called links [11]. Edges can connote various statistical estimates such as correlation coefficients and they can show their directionality (positive or negative) as well as their magnitudes. The network structure depends on the edge threshold, such as the p-value or coefficient values, and the interpretation can also vary since nodes can only appear in the network when they are connected to an edge. Network analysis is possible only after network construction is completed.
Correlation coefficients are commonly used to represent relationships between variables [12]. Depending on whether the data are parametric or non-parametric, researchers can choose Pearson, Spearman, or Kendall correlation coefficients. However, these methods do not adjust for confounding effects from the other variables, so spurious edges might appear. The partial correlation method can provide a coefficient in which the effects of other variables are controlled [11,13]. Thus, partial correlations are recommended as a method that can also suggest plausible potential causal relationships.
Occasionally, a network has a tremendous number of nodes and edges. The more variables are displayed in a network, the more information will be obtained, but highly complex networks can be difficult to visualize and interpret. Thus, higher thresholds can help in some cases. A stricter threshold of the p-value (<0.05, <0.01, or lower) and coefficient values (>0.5, >0.7 or higher) can be used, or the least absolute shrinkage and selection operator (‘lasso’) can also be applied [11,13].
To date, although numerous tools have been developed for network analysis [9,14], the most representative tools are Cytoscape (https://cytoscape.org/) and R software (https://www.r-project.org/), and many tutorials have been published [11,15-17]. R has various packages for network analysis, for instance, corr or pcor to calculate correlation coefficients [18,19] and ggraph, igraph, qgraph, or Rgraphviz to visualize networks [20-23]. Therefore, network analysis can be performed within a single platform from start to finish. Instead, Cytoscape needs an appropriate input format from a correlation matrix that can be obtained from other statistical analysis tools. A previous study [12] provided a detailed process for the input format to Cytoscape. Nevertheless, the network can be handled, edited, and annotated much more easily in Cytoscape since it is a graphical-user-interface–based program [12,15].
Recently, some noteworthy web-based tools for network analysis centered on metabolomics data or metabolic pathways have been developed, such as the Metabolic network Analysis and Pathway Prediction Server (MAPPS) and the integrated Metabolomics Analysis Platform (iMAP). MAPPS provides various analytical resources including pathway prediction based on public databases, metabolic reachability, metabolite-specific reactions, network building and comparison [24]. iMAP also provides functions for network construction, visualization, and analysis with a user-friendly interface [25]. Although both tools focus primarily on metabolite data, these tools allow users to analyze omics data with additional transcriptomics or proteomics data sets. However, the papers presenting those tools still describe the use of Cytoscape for topological analysis or more personalized modifications [24,25].
INTERPRETATION OF NETWORK ANALYSIS
Once network visualization is complete, basic inferences are possible based on the graphical structure and relationships between the variables. Although some clusters with relatively many nodes gathered together or overall structural characteristics (e.g., density or sparseness) can be observed, these do not provide an in-depth interpretation of the relationships between variables.
Interpretation of the network is possible through various parameters that can be obtained by network analysis. Some representative parameters are degree and betweenness, which have been defined in greater depth elsewhere [11]. In brief, the degree is defined as the number of edges that connect to a node. Therefore, this parameter denotes the centrality of a node and the level of involvement of a node in the network. The nodes with the highest degrees can be interpreted as “hub” nodes that play central roles in the relationships being analyzed. Betweenness is defined as the average path between other pairs of nodes and quantifies the importance of a node. When a node lies on the shortest path between two other nodes, it means that those connections are more important than other connections. Thus, a higher value of the betweenness parameter indicates that a node plays a key role in the network. It is not always the case that the node with the highest degree has the highest betweenness. Therefore, researchers usually sort by degree and then find the highest betweenness or vice versa [12].
When differences are examined between 2 or more groups, networks can be constructed for each group and then compared (Figure 1). The first method uses Cytoscape, which provides a topological comparison. After calculating the correlation matrix in each group to be compared, each network should be constructed in Cytoscape (Figure 1A and B). Then, using “merge” in “Tools,” 2 networks can be combined into 1 network using various options. When users select the “difference” option, a new network is created after excluding overlapping edges and nodes between two networks. By designating a reference network, each network has unique edges and nodes in that group (Figure 1C and D). In this way, a structural interpretation is possible based on the unique relationships in each network [12].
Another method involves analyzing the statistical difference of correlation coefficients (i.e., differential correlations). Fukushima [26] introduced the method of calculating differential correlations using the DiffCorr package in R. In brief, differential correlation coefficients and p-values can be calculated based on the Fisher z-test between 2 correlation matrices after Fisher transformation of the coefficients. Users can proceed with visualizing the network in R, or they can import this data frame to Cytoscape to construct a network that presents the differential correlations between 2 groups. This network can be interpreted as indicating links with significantly different relationships between groups (Figure 2).
NETWORK ANALYSIS EXAMPLES
The methods described above have been practically applied in several studies, and diverse interpretations have been made according to the network visualization method. For example, Batushansky et al. [12] aimed to reveal metabolic differences between normal conditions and hypoxic conditions in breast carcinoma cell lines. Two networks based on the correlation coefficients of metabolites were constructed, and they used the “merge” tool in Cytoscape to make unique networks under normal conditions and hypoxic conditions. The authors suggested potentially important metabolites in each network via parameters such as degree and betweenness that were obtained from NetworkAnalyzer in Cytoscape. This study interpreted the results as indicating that hypoxic conditions involved more metabolic paths in cell metabolism because there were more unique edges in the unique network of hypoxic conditions than in the network of normal conditions, and a possible explanation for the different mechanisms related to the higher degree and betweenness of lactate, gamma-aminobutyric acid, alanine, and creatinine in each network [12].
The difference in the networks between 2 groups can also be examined by differential correlations as a statistical approach [26], while the “merge” tool in Cytoscape is a topological comparison as described above. Using the differential correlation method, Li et al. [27] demonstrated differences in metabolite networks between men and women, Wang et al. [28] revealed differences in metabolite networks between age groups (<50 and ≥50 years old) in men and women, and Costello et al. [29] found novel metabolites that showed differences between 2 phenotypes regarding joint replacement. These studies represented edges as colors where the differential correlations were positive or negative. In these cases, the original relationships between variables (negative or positive correlations) are unknown. Alternatives would be using both colors (red and blue) and shapes (solid and dotted) to reflect not only original relationships (e.g., the direction of correlation coefficients) between nodes, but also which group had stronger relationships (e.g., the direction of the differential correlation). The magnitude of the coefficients can also be shown as the width of edges.
Multiple networks according to thresholds such as p-values or coefficients can be constructed to zoom in and focus on stronger relationships between variables in the network. Huang et al. [30] constructed networks based on differential correlations between type 2 diabetes patients and a control group using 27 biomarkers related to type 2 diabetes. By comparing 3 different networks according to the thresholds (the magnitude of coefficients and p-value), the authors found that leptin was strongly linked to adiponectin and insulin-like growth factor binding protein 2, and that leptin played a central role in diabetes development.
The network embodies diverse information, mainly by edges such as color, shape, and width as described above. However, nodes can also represent statistical estimates as color or shape. Floegel et al. [31] constructed a network that showed relationships between metabolites. The authors colored in nodes to indicate associations between metabolites and various lifestyle factors, including diet, physical activity, and obesity. With this approach, the network contains information not only about relationships among nodes, but also associations between nodes and other factors such as exposures or outcomes of interest.
CONCLUSION
Taken together, network analysis is advantageous in that it can show the relationships among multiple variables in an integrated approach. In addition, clusters composed of variables can be identified through the visual structure, and variables that play an important role in the network can be found from parameters obtained through network analysis. Through this process, a potential mechanism can be suggested, and, therefore, further research (or experiments) focusing on a specific factor or pathway can be proposed.
However, there are also points meriting caution in network analysis. Depending on the data transformation method and the edge presentation method (correlations, partial correlations, or differential correlations), the structure of the final network will be different, which can lead to a loss of information and thus misinterpretation. Moreover, the network can be used to propose a potential mechanism, but not to establish it.
To date, most studies using network analysis have been conducted at a single layer (e.g., metabolomics, genomics, or blood biomarkers). This tendency may be due to the difficulties in finding ideal statistical methods to merge omics data, which have different properties in terms of normality or scale. Nevertheless, attempts should be made to integrate and analyze multi-omics data through the development of suitable statistical methods in order for systems epidemiology to reach its considerable potential.
Ethics Statement
This paper is a special article based on literature review, so it did not need ethical approval.
Notes
CONFLICT OF INTEREST
The authors have no conflicts of interest associated with the material presented in this paper.
FUNDING
This study was supported by a National Research Foundation of Korea grant funded by the Korean government (NRF2018R1A2A3075397) and by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI19C1178).
ACKNOWLEDGEMENTS
None.
Notes
AUTHOR CONTRIBUTIONS
Conceptualization: JYP, JC, JYC. Funding acquisition: JYC. Writing – original draft: JYP. Writing – review & editing: JC, JYC.