2.3.1 Graph theory
Complex dynamical systems are formed by many interacting components that generate the so called emergent properties, which exceed superposition of individual components.
This is very clear in biology where the behaviour of cells (and much more complex systems) depends on the interaction of thousands of molecular elements.
The analysis of these complex networks has become one of the main tools in molecular systems biology.
Network theory (and its core of graph theory) allows the translation of the biological reality of the cell complexity into a mathematical description that can be modelled and analysed.
This section highlights the basic concepts of graph theory and the main areas in network analysis.
Nodes and edges
The word networks refers to an informal representation of a group of elements that are connected to each other.
The mathematical entity normally used to analyse networks is the graph.
A graph is a mathematical abstraction that has nodes and edges, representing elements and interactions respectively.
A graph G = (N,E) is composed of a group of nodes or vertices N and a group of edges E.
Each edge is assigned to two nodes.
An edge e that connects the nodes a and b is denoted by a, b.
Nodes a and b are then adjacent and the edge e is incident to them.
The number of edges incident to a node is called degree.
A edge connecting a node with itself (a, a) is called loop.
A sequence (n1, e1, n2, ..., ek, nk) of connected and consecutive nodes and edges is called a walk or path, although some authors reserve the term path for a walk without repeating nodes or edges.
A path that starts and ends in the same node is usually called cycle.
Two nodes are connected if there is a path between them.
The length of a path is usually given by its number of edges.
The shortest path between two nodes is the path with minimal length.
The distance between two nodes is the length of the shortest path between them.
For instance, in Figure 2.11a P1 = ((1, 2), (2, 3), (3, 4)) is a path from node 1 to node 4 of distance 3.
However, 29 the shortest path between those two nodes is P2 = ((1, 3), (3, 4)), which has a distance of 2.
Additionally, the path P3 = ((3, 2), (2, 1), (1, 3)) is an example of cycle.
The simplest way to measure density of nodes in a network or in a specific area of it is the clustering coefficient (Watts and Strogatz 1998).
This parameter shows how tightly a node is linked to its neighbours.
For each node, this coefficient is usually defined as the proportion of links between the nodes within its neighbourhood divided by the number of links that could actually exist between them.
It gives an idea of how probable is that if node A is connected to node B and B is connected to node C, then A is also connected to C.
Figure 2.11. Undirected (a), directed (b) and directed weighted (c) variants of a graph.
If every node in the graph is connected then the graph is called also connected.
The largest connected component is the largest subgraph of the original graph that is connected.
In Figure 2.11a, for instance, the node set N1 = (1, 2, 3, 4) is the biggest connected component (it is larger than N2 = (5, 6, 7)).
Graphs can be undirected, if the nodes are linked by edges without any directional information or directed.
In directed graphs, the out-degree of a node is defined as the number of edges going out of it while the in-degree is the number of edges coming in.
The degree of a node is the sum of both.
In addition, in weighted graphs, each edge carries a value that represents the strength or flow of such interaction.
Figure 2.11 shows three graphs with the same layout but with increasing information: Figure 2.11a shows an undirected graph, Figure 2.11b displays a directed graph and finally Figure 2.11c illustrates a directed weighted graph.
It is possible to have undirected weighted graphs but they are less common, since a magnitude such as flow or cost usually carries also a particular direction.
Visual and mathematical representations
Displaying graphs to truthfully represent real connected systems is a tough task.
Real phenomena can be very complex and therefore their representation and mathematical description is not straight forward.
In molecular biology many interactions occur among more than two elements, for instance the substrates and products of a metabolic reaction or the formation of a protein complex.
Hypergraphs (Klamt, Haus, and Theis 2009; Zhou and Nakhleh 2011) are used to represent such interactions.
Hypergraphs are defined as normal graphs in which edges link together more than two nodes.
As it was commented in Section 2.2.2 this approach displays the system in a simpler and more intuitive way but has a mayor disadvantage.
Hypergraphs are not commonly used in graph theory and most of the methods and algorithms developed can not be used on them.
The classic way to dodge this issue is using bipartite graphs.
Figure 2.12 shows the difference between a hypergraph and a bipartite graph for a metabolic reaction (figures built using the Escher visualization tool from King et al. 2015b).
Precise and truthful network representations is one of the core issues dealt with in Chapter 5.
Figure 2.12. Phosphofructokinase reaction that catalyses the conversion of D-fructose-6- phosphate to D-fructose-1,6-biphosphate.(a) shows the hypergraph and (b) the bipartite graph.
Network properties
Graph theory has developed a series of parameters to better describe and understand a network’s structure.
The most simple parameters are the distribution or mean across the network of the node parameters already defined: degree distribution, shortest path and clustering coefficient.
The degree distribution gives the probability that a node has an exact number of nodes.
This representation is one of the most common ways of classifying networks.
The degree distribution of most biological networks approximates a power law (with an exponent usually between two and three).
These are commonly called scale-free networks (Barabási and Albert 1999).
In scale-free networks a few well connected nodes called hubs keep bind together most of the network generating an low average shortest path.
These hubs are a common feature in every biochemical network: very connected metabolites such as ATP or pyruvate, transcription factors influencing the expression of many genes or proteins interacting with many other proteins.
Coupled with a scale-free behaviour, many biochemical networks exhibit a high clustering coefficient (Wagner 2001; Wagner and Fell 2001; Wuchty 2001).
The clustering coefficient of a network is the average of the coefficients of every node.
Both characteristics together (scale-free behaviour and high clustering coefficient) define the "small world" effect (originally proposed in Watts and Strogatz 1998).
Typically, in a small world network the shortest path between two random nodes grows proportionally to the logarithm of the number the nodes in the network.
It is important to mention that for stating that some network parameter is high or low, a network benchmark is needed.
Traditionally, the Erdos-Rényi model for random networks (Erdos 1959) is used to compare the structure and features of other networks.
Assortativity is another frequent property in network description and analysis.
It refers to the preference of the network’s nodes to link to other that are similar to them in some way.
Node degree is normally used to define that similarity among nodes.
In general terms a network is said to be assortative if hubs tend to attach to other hubs more than to normal nodes.
The opposite is true for dissortative networks.
Most biological networks seem to show a dissortative behaviour, which may be related to how the network was formed and evolved.
Finally, centrality measures how "important" a node is in the network structure.
Importance here refers to the cohesiveness of the network and its flow structure.
Two main centrality measures exist: closeness and betweenness centrality.
All these properties are studied in detail for a protein-protein interaction network in Chapter 3.
Functional modules
It is frequent that networks have communities of highly interconnected nodes that are less connected to nodes in other communities.
This modular structure has been reported in all kinds of networks, being biochemical networks no exception (Guimera and Nunes Amaral 2005; Hartwell et al. 1999; Holme, Huss, and Jeong 2003; Papin, Reed, and Palsson 2004; Ravasz et al. 2002).
It is broadly accepted that the modular structure of complex networks plays a essential role in their functionality, just as protein structure plays a critical role in its function.
Consequently there is a need to develop methods that are able to search and identify those modules.
This problem takes many names in the literature from module identification to community detection.
The idea of functional modules in biological networks appears in many shapes and forms across recent bibliography (Girvan and Newman 2002; Hartwell et al. 1999).
It is also given a lot of attention in this thesis.
The core premise is simple: biological networks are not structure uniformly, they have sections that are densely connected with themselves but sparsely connected with the rest of the network (see Figure 2.13).
This core idea has been extensively studied in several biomolecular networks, using different data and following different methodologies.
Mering et al. 2003 used genomic data to cluster metabolic enzymes and then compared them with metabolic pathways.
Spirin and Mirny 2003 determined the multibody structure of protein-protein interaction networks from yeast proteomic data.
In the seminal paper Ideker et al. 2002, authors integrated protein-protein and protein-DNA interaction networks with gene expression data to find regulatory pathways.
Guimera and Nunes Amaral 2005 used modularity to find functional modules in the metabolic networks of twelve organisms.
In Dittrich et al. 2008, authors used an heuristic approach to calculate maximum scoring network regions in protein-protein interaction networks.
Figure 2.13. Community detection process in a high density network (edited from Guimera and Nunes Amaral 2005). (a) Original representation. (b) Communities start to appear. (c) Identified communities.
Many modularity-based methods (Newman 2006; Newman and Girvan 2004) have been successfully applied to identify functional modules or communities in protein interaction networks by searching for high density connectivity subnetworks.
However, it has been recently recently shown that optimizing modularity can over- or under partition the network, failing to find the most natural community structure (Fortunato and Barthélemy 2007).
To overcome this issue, a wave of approaches based on the modularity metric was developed including ad hoc parameters that can be tuned to focus on specific-size modules or communities (Lancichinetti, Fortunato, and Kertesz 2009; Reichardt and Bornholdt 2006).
On the other hand, community detection methods based on Markov random walks on graphs started to be developed and successfully applied to the problem at hand (Satuluri, Parthasarathy, and Ucar 2010; Voevodski, Teng, and Xia 2009).
Combining these two trends (resolution capability to find which communities are relevant in each time scale and application of Markov random walks of graphs) the Markov Stability methodology was developed (Delvenne, Yaliraki, and Barahona 2010; Delvenne et al. 2013).
The stability of a partition (defining partition has a particular network structured in different communities) measures its quality as a community structure based on the clustered autocovariance of a dynamic Markov process taking place on the network.
This methodology is explained in more detail in the next subsection and applied to the metabolic network of E. coli in Chapter 5.
Another branch of research that deals with community detection, in this case restricted to metabolic network, is known as elementary modes, extreme pathways and minimal generators (Ferreira et al.2011; Llaneras and Picó 2010).
They are not used in this thesis but it is worth it to mention them because they have gathered some attention these last few years.
The three groups aim to identify the relevant network-based pathways in a metabolic network.
Essentially, there are two properties that these groups of pathways (which is just another name for communities of reactions) can hold: they can generate the flux space or they can comprise all the nondecomposable pathways in the network.
Another tool for identifying functional modules is used in this thesis in conjunction with graph theory: multivariate statistics.
Since these techniques come from a slightly different mathematical background, they are exposed in another section (Section 2.3.3).
The exact mathematical formulation for Multivariate Statistics techniques is shown in Chapter 3.
Community detection based on Markov Stability
The communities were extracted in each network using the Markov Stability (MS) community detection framework (Delvenne, Yaliraki, and Barahona 2010; Delvenne et al. 2013).
This framework uses diffusion processes on the network to find groups of nodes (i.e., communities or functional modules) that retain flows for longer than one would expect on a comparable random network; in addition, MS incorporates directed flows seamlessly into the analysis (Beguerisse-Díaz et al. 2014; Lambiotte, Delvenne, and Barahona 2014).
The inclusion of direction in the flows is vital for an accurate representation of metabolism.
The diffusion process used is a continuous-time Markov process on the network.
From the adjacency matrix A a rate matrix is constructed for the process: M = K−1 outA, where Kout is the diagonal matrix of out-strengths, kout,i = P j ai,j .
When a node has no outgoing edges then simply kout,i = 1.
In general, a directed network will not be strongly-connected and thus a Markov process on M will not have a unique steady state.
To ensure the uniqueness of the steady state a teleportation component must be added to the dynamics by which a random walker visiting a
Figure 2.14. Evolution of the multiscale community structure of a simple hierarchical network (from Lambiotte, Delvenne, and Barahona 2014). The number of communities decrease with increasing Markov time. At certain time points the partition with maximum stability jumps from a higher number of communities to a lower number. For instance, until Markov time around 0.12 the stablest partition contains 16 communities (red curve) but at that point the partition with 8 communities (green line) overtakes its predecessor and becomes the stablest partition. Evaluation of the Markov Stability shows that, as time grows, the optimal partition goes from 16 communities to 8 to 4 to 2 over different time intervals. The figure shows a network with 24 = 16 nodes with edges shaded according to their strength. By symmetry, the natural partitions are into 16 single nodes, 8 pairs (colours), 4 tetrads (shapes) and 2 groups of 8 nodes (upper and lower hemispheres).
node can follow an outgoing edge with probability _ or jump (teleport) uniformly to any other node in the network with probability 1 − _ (Page et al. 1999). The rate matrix of a Markov process with teleportation is:where the N × 1 vector a is an indicator for dangling nodes: if node i has no
outgoing edges then ai = 1, and ai = 0 otherwise. Here = 0.85 is used. The
Markov process is described by the ODE:where L = IN − B. The solution of (2.2) is x(t) = e−tLT x(0) and its stationary
state (i.e., x˙ = 0) is x = , where is the leading left eigenvector of B.
A hard partition of the network into C communities can be encoded into the N×C
matrix H, where hic = 1 if node i belongs to community c and zero otherwise.
The C × C clustered autocovariance of (2.2) isThe entry (c, s) of R(t,H) is how likely it is that a random walker that started
the process in community c at finds itself in community s at time t.
Crucially,
the diagonal elements of R(t,H) show how good are the communities in H at
retaining flows. The stability of the partition is thenThe communities are found in the network by optimising (2.
4) over the space of partitions for a given time t using the Louvain greedy optimisation heuristic (Blondel et al. 2008).
The Louvain algorithm does not guarantee a globally optimal partition of the network into communities; for this reason (2.4) is optimized 100 times for each value of t.
The consistency of the resulting partition is assessed using the Variation of Information (VI) metric (Meila 2007), as described in Lambiotte, Delvenne, and Barahona 2014; Schaub et al. 2012.
A low value of the VI implies that the 100 partitions obtained from Louvain are similar; a VI of exactly zero means that all the partitions in each of the 100 optimisations are identical.
The value of the Markov time t, i.e.
the duration of the Markov process, is also a resolution parameter for the partition of the network into communities (Delvenne, Yaliraki, and Barahona 2010; Schaub et al. 2012).
In the limit t ! 0, Markov stability will assign each node to its own community; as t grows, larger communities are obtained because the random walkers have more time to explore the network.
Finally, when t ! 1 all nodes merge into a single community comprising the entire network (Delvenne et al. 2013).
A range of values of t is scanned to explore the multiscale community structure of the network.
Figure 2.15. Network analysed with edges coloured according to the community structure found in the flow-redistribution matrix using the Markov stability method (from Schaub et al. 2014). The partition into 6 communities is stable over a long span of the Markov times with vanishing variation of information, thus signalling its robustness.