2.3.3 Multivariate statistics

Multivariate statistics is a subdivision of statistics that includes the simultaneous observation and analysis of more than one outcome variable.

This branch of techniques are used to carry out trade studies across multiple dimensions while taking into account the effects of all variables on the system’s response.

Some types of very common data analysis such as linear and multiple regression are not multivariate analysis in the sense that they consider only the univariate conditional distribution of a single outcome variable given the other variables.

Some techniques in multivariate analysis include multivariate analysis of variance (MANOVA), multivariate regression and factor analysis among others.

The methods used in this thesis are maybe the most successful and widely used examples of multivariate analysis: the Principal Component Analysis (PCA) and the Partial Least-Squares regression (PLS).

These have been extensively used in systems biology before in a wide range of applications:

from data fusion (Folch- Fortuny et al. 2016), pathway determination (Ferreira et al. 2011; Folch-Fortuny et al. 2015), network decomposition (Barrett, Herrgard, and Palsson 2009) and metabolic flux analysis (González-Martínez et al. 2014).

PCA will be used in Chapter 4 for constraint-based metabolic model characterization and PLS will also be used in Chapter 4 for analysing the robustness of the conclusions and in Chapter 3 for finding functional modules in a viral PPIN.

Principal Component Analysis PCA is probably the most widespread multivariate statistical method being used in virtually all scientific fields.

Its modern formulation comes from Hotelling 1933.

An excelent recent introductory review of the method can found in Abdi and Williams 2010.

PCA analyses a data table representing observations described by some correlation variables that often are inter-correlated.

Its main objective is to extract the most relevant information from the data and to display it as a set of new orthogonal variables known as principal components.

The main goals of PCA can be summarized in:

• Gathering the most relevant information from the data.

• Condensing the size of the data set.

• Break down the description of the data set.

• Dissecting the structure of the observations and variables.

The data table to be studied by PCA contains I observations or individuals described by J variables and it is represented by the I x J matrix X.

This matrix X has rank L where L _ min{I, J}.

There is an standard pre-processing of the data known as centering and re-scaling.

Centering means that the mean of each column is equal to 0 so each column is centered around this value.

Re-scaling refers to setting the variance of each varible to 1 in order to make possible the comparison (see Figure 2.20).

The first principal component is required to explain the largest variability possible.

The second component determined must be orthogonal to the first component and it has to explain the largest possible variability.

Following components (if necessary) are calculated this way.

The values of these new variables are known as factor scores and represent the projection of the observations onto the principal components.

Finding the components comes from the SVD of the data table X:

where P is the I x L matrix of left singular vectors, QT is the L x J matrix of right singular vectors and  is the diagonal matrix of singular values. In PCA the I x L matrix of factor scores, F is defined as

Figure 2.20. Two-step pre-processing in PCA and PLS methods.

The matrix Q, known as the loading matrix, returns the coefficients of the linear combinations used to compute the factor scores.

The principal components can be represented geometrically by the rotation of the original axes.

The factor scores give the length of the projections of the observations on the new components.

The loadings are then interpreted as direction cosines of the new components from the original variables.

From this point of view the matrix X can be interpreted as a bilinear decomposition that produces the factor score matrix F by the loading matrix Q:

Figure 2.21 shows a simple example of the PCA methodology.

The data matrix X has three variables: x1, x2 and x3.

PCA found two principal components: t1 and t2.

The first component t1 is the straight line that better approximate the data (solving a least square problem to minimize the distance between the data and the new line).

The direction of t1 is determined by the loading vector p1.

The new coordinate for observation i is ti1.

The second principal component t2 is another straight, orthogonal to t1 that better approximate the data.

The direction of t2 is determined by the loading vector p2.

Both principal components define a hyperplane in the space defined by The direction of t1 is determined by the loading vector X.

The projections of the observations onto this hyperplane is the scores vectors for the first t1 and the second t2 component.

Figure 2.21. Simple example of the PCA methodology.

The maximum number of components is equal to L, the rank of X.

This corresponds to the number of non-zero singular values present in _.

These singular values are ranked from higher to lower in the diagonal matrix _.

The actual value of each singular value is precisely the variance that is explained by each component.

Therefore, only the first components with high singular values are actually useful.

In most cases a few components explain most of the variance of the data and accordingly there is a significant compression and clarification of the data.

It is possible to approximately reconstruct the original data matrix X from the scores ta and the loadings pa, being a the number of selected components (a _ L) (see Equation 2.12).

where E is the residual matrix that contains the part of each observation not explained by the principal components.

When a = L, Equations 2.11 and 2.12 are equivalent.

On the other hand, there are a few graphic representations of the PCA analysis available to explore the results:

• T2-Hotelling and SPE

• Scores t/t

• Loadings p/p

• Scores t / observations (or experiments)

• Loadings p / variable X

They will be explained in detail when used with the experimental data in Chapter 4.

PP Partial Least-Squares regression

Also known as projection to latent structures combines ideas from PCA and multiple linear regression (Abdi 2010).

Its main goal is to analyse a set of dependent variables from a set of independent variables.

This analysis or prediction is carried out by extracting from the original variables a set of orthogonal latent variables which represent the best predictive power.

Following the notation defined previously with PCA, I observations described by K dependent variables are stored in an I x K matrix called Y.

The values of the independent variables J on these I observations are collected in the I x J matrix X.

The goal of PLS is to describe or predict Y from X and analyse their common structure.

By contrast, PCA decomposes X in order to obtain components that explain X better.

PLS, however, finds components from X that best predict Y.

The method searches for a set of latent variables and carries out a simultaneous decomposition of X and Y with the constraint that these components must explain as much covariance as possible between X and Y.

PLS regression decomposes both X and Y as a product of a common set of orthogonal

factors and a set of specific loadings. The independent variables are broken up in:

By analogy with PCA, T is the score matrix and P is the loading matrix.

Y is estimated as

where B is a diagonal matrix with the ’regression weights’ as diagonal elements and C is the ’weight matrix’ of the dependent variables.

The columns of T are the latent vectors (just as the columns in F were the scores in PCA).

When their number is equal to the rank of X, they represent an exact decomposition of X.

Additional constraints are needed to define T.

In the case of PLS this means finding two sets of weights w and c in order to create a linear combination of the columns of X and Y such that these two linear combinations have maximum covariance.

First the pair of weight vectors are obtained:

meeting always the constraints that w|w = 1, t|t = 1 and t|u is maximum.

When the first latent vector is found, it is subtracted from both X and Y and the process starts all over again until X becomes a null matrix.

In general terms the quality of the prediction does not always increase with the number of latent variables used in the approximation.

Usually, the quality increases first and then decreases.

If and when the quality of the prediction decreases when the number of latent variables increases, an event of data overfitting is occurring.

Therefore it is very relevant to construct a model with the optimal amount of latent variables.

The most common approach is the ratio Q2l which relates the residual sum of squares and the predicted residual sum of squares.

Concrete examples of this criteria and the graphic representations of PLS (which are analogous to those in PCA) are shown in Chapters 3 and 4.

results matching ""

    No results matching ""