Dimensionality Reduction Methods for Molecular Motion

Topics in this Module

The study of many biological processes at the molecular level involves understanding how biological molecules (especially proteins) behave dynamically. The three-dimensional shape of these molecules, which we call a conformation, usually determines the chemical action they perform. Both the stable (also called native) shape of a biomolecule and dynamical deviations from it are important to understand how it interacts with other molecules such as pharmaceutical drugs or other complexes. For these reasons, understanding the main shapes and motions of these molecules is of utmost importance.

Current structural biology experimental methods are restricted in the amount of information they can provide regarding protein motions because they were designed mainly to determine the three-dimensional static representation of a molecule. For this reason, in silico methods (i.e., run in a computer) are used to extensively sample protein conformations. As a summary, the most popular methods to gather protein conformations are:

- X-Ray Crystallography. The most established and accurate method of determining the three-dimensional structure of a protein is X-ray crystallography. This technique is based on the collection of diffraction data generated by exposing a protein crystal to an X-ray beam. The main limitation of this experimental technique is that it is necessary to obtain protein crystals in order to collect experimental data. Unfortunately, creating protein crystals is a very lengthy and laborious process which is not always successful.
- Nuclear Magnetic Resonance (NMR). The second most common method of determining the structure of a protein is NMR {Wuthrich, 1986 #184}. This method uses a spectroscopy approach to collect the experimental data necessary for structure determination. This method is in general not as accurate as X-ray crystallography and its use is limited to small and medium-sized proteins. However, it provides useful information about protein dynamics directly and avoids some of the problems of X-ray crystallography such as protein crystallization.
- In silico sampling. An alternative to using experimental methods to derive structural data is using computational methods such as Molecular Dynamics (MD) or Monte Carlo (MC) simulations, or other forms of computational sampling. In fact, computational methods are used to augment existing experimental data since MD simulations typically start from a three-dimensional protein structure determined by X-ray crystallography or NMR. MD uses an empirical force field to approximate the potential energy of a protein shape. Once a force field model has been chosen, the time evolution of the system is determined by numerically solving the resulting equations of motion. One of the main disadvantages of MD is that it is very computationally expensive, making it impossible (with current technology) to run all-atom simulations of big proteins for time-scales relevant to the majority of biological processes. Nevertheless, simulations can provide us with invaluable data since they are the only method of observing proteins in “real time”. Recently, the development of so-called coarse-grained models (which model a group of atoms as a single entity) has allowed the sampling of longer times. MD is a good data source for sampling purposes because it can provide a large number of conformations of a molecule. For an introduction to Molecular Dynamics simulations, please refer to [6].

The computations required to simulate protein motion in silico are very expensive and involve non-trivial force field evaluations, as explained above. These simulations provide us with the $\mathrm{(x,y,z)}$ positions of all atoms in the molecule; for a molecule with N atoms, this amounts to 3N numbers per conformation. For interestingly sized molecules (such as proteins) the number of atoms is large and thus the dimensionality of the obtained data is extremely high. That is, a conformation sample for a protein with N atoms can be thought of as a 3N-dimensional point.

However, many biological processes are known to be very structured at the molecular level, since the constituent atoms self-organize to achieve their bio-chemical goal. An example of such a process is protein folding, the process by which a protein achieves its thermodynamically-stable three-dimensional shape to perform its biological function (a depiction of a protein folding process is shown in figure 1). To study such processes based on data gathered through simulations, there is a need to "summarize" the high-dimensional conformational data. Simply visualizing the time-series of a moving protein as produced by simulation packages does not provide a lot of insight into the process itself. One way of summarizing these motions is to turn conformations into a low-dimensional representation, such as a vector with very few components, that somehow give the "highlights" of the process. This data analysis process -turning high-dimensional data into low-dimensional data for better interpretation- is called dimensionality reduction and is discussed next.

When molecular shapes are sampled throughout some physical-chemical process that involves the motion of the molecule, there is a need to simplify the high-dimensional (albeit redundant) representation of a molecule given as a 3N-dimensional point, since it is believed that the actual degrees of freedom (DoFs) of the process are much less, as explained before. The resulting, simplified representation has to be useful to classify the different conformations along one or more "directions" or "axes" that provide enough discrimination between them.

Dimensionality Reduction techniques aim at analyzing a set of points, given as input, and producing the corresponding low-dimensional representation for each. The goal is to discover the true dimensionality of a data set that is only apparently high-dimensional. There exist mathematical tools to perform automatic dimensionality reduction, based on arbitrary input data in the form of high-dimensional points (not just molecules). Although different techniques achieve their goals in different ways, and have both advantages and disadvantages, the most general definition for dimensionality reduction could be stated as:

Dimensionality Reduction

- INPUT: A set of M-dimensional points.
- OUTPUT: A set of $d$-dimensional points, one for each of the input points, where $\mathrm{d<<M}$.

Some dimensionality reduction methods can also produce other useful information, such as a "direction vector" that can be used to interpolate atomic positions continuously along the main motions (like in PCA, see below). For both a general discussion and specific methodology on dimensionality reduction, refer to [7], and for more information on non-linear methods, see [8].

As a simple example of dimensionality reduction, consider the case of a bending string of beads, as depicted in figure 2. The input data has 3x7=21 dimensions (if given as the $\mathrm{(x,y,z)}$ coordinates of each bead) but the beads always move collectively from the "bent" arrangement (left) to the "straight" arrangement (right). Under this simplified view, the process can be considered as one-dimensional, and a meaningful axis for it would represent the "degree of straightness" of the system. Using this axis, each string of beads can be substituted by one single number, its "coordinate" along the proposed axis. Thus, the location of a shape along this axis can quickly indicate in what stage of the bending process it is.

When dimensionality reduction methods are applied to molecular motion data, the goal is to find the main "directions" or "axes" collectively followed by the atoms, and the placement of each input conformation along these axes. The meaning of such axes can be intuitive or abstract, depending on the technique used and how complex the system is. We can reword the definition of dimensionality reduction when working with molecular motion samples as:

Dimensionality Reduction of Molecular Motion Data

- INPUT: A set of molecular conformations sampled from some physical process, given as the $\mathrm{(x,y,z)}$ coordinates for each atom. These are 3N-dimensional points for a molecule with N atoms.
- OUTPUT: A set of $d$ coordinates for each input conformation, such that $\mathrm{d<<3N}$. These $d$ coordinates should help classify the conformations throughout the main stages of the studied process.

Dimensionality reduction methods can be either linear or non-linear. Linear methods typically compute the low-dimensional representation of each input point by a series of mathematical operations involving linear combinations and/or linear matrix operations. Non-linear methods use either non-linear mathematics or modify linear methods with algorithmic techniques that encode the data's "curvature" (such as Isomap, explained later). Both categories of methods have advantages and disadvantages, which will become clear through the rest of this module. More information on different linear and non-linear techniques for dimensionality reduction are given in [7]. Also, a more extensive classification of dimensionality reduction techniques into several categories, and examples of different algorithms can be found navigating this hypertext article.

The remainder of this module describes two dimensionality reduction techniques and their application to molecular motion data. These are Principal Components Analysis (PCA), a linear method, and ISOmetric feature MAPping (Isomap), a non-linear method.

We will start our discussion of Principal Components Analysis, or PCA, considering a very general data set of points as depicted in figure 3. Each point in this simple data set is given as a 3-dimensional vector $\mathrm{(x,y,z)}$ (the discussion will later be turned to the molecular motion domain, and the interpretation of such data). Even though this data set is given as 3-dimensional points, it is obvious from the figure that the data points are distributed mostly on a two-dimensional surface. Our objective is then to find the inherent, 2-dimensional parameterization of this data set. (For a full discussion on PCA, see Principal Components Analysis by I. T. Jolliffe).

For data points of dimensionality M, the goal of PCA is to compute M so-called Principal Components (PCs), which are M-dimensional vectors that are aligned with the directions of maximum variance (in the mathematical sense) of the data. These PCs have the following properties:

- The PCs are ordered by data variance. In other words, the first PC is aligned with the direction of maximum variance, the second PC in the next direction contributing to the most variance, and so on.
- The PCs form an orthonormal basis, that is, they are all mutually perpendicular and have unit length. This gives PCs the useful property of being uncorrelated.

For example, in figure 3 b), the PCs have been superimposed with the data set (PCs are drawn with different lengths to illustrate the amount of data variance they account for, but remember they are actually of unit length). Since the PCs are orthonormal, they form a vector basis in terms of which the data set can be expressed. An alternative, equivalent view, is that the PCs become aligned with the canonical axes X,Y,Z,... if the data set is rotated or rigidly transformed so that the directions of maximum cumulative variance are made to coincide with the canonical base. For the simple example, the reader can agree that the last direction of maximum variance, the 3rd in this case, accounts for little or no data variability. It is customary, when projecting the original data set into the principal components, to discard the components that do not add a significant contribution to the data variance. In figure 3 c), the third component has been discarded and the projection of the data onto the first two components is shown. Discarding the least-important components is how dimensionality reduction is performed.

For M-dimensional input data, the Principal Components (PCs) are M-dimensional vectors and have two main uses. These are to:

- Project the input data onto the PCs. Taking the dot product of an input data point with any PC returns the scalar value of the projection of the point onto the PC. Since the PCs have unit length, this projection serves as the coordinate of the input point along the PC in question. In principle, M-dimensional input data can be projected onto its M PCs, but typically we are not interested in the lesser ones (the data set would be nicely aligned with the PCs, but we would still be using M coordinates for each point). Using just the first few PCs as a basis and computing the projections onto say, the first $d$ PCs yields the best $d$-dimensional representation for each point, from a maximum variance point of view (e.g., $\mathrm{d=2}$ in figure 3-c). This is the actual dimensionality reduction.
- Interpolate or synthesize new points. The PCs themselves point in the direction of maximum variance, as explained above. For this reason, ${\mathrm{PC}}_{i}$ can be used as a direction vector along which new points can be synthesized by choosing parameter values ${a}_{i}$ and then producing artificial M-dimensional points by doing the linear combination ${a}_{1}{\mathrm{PC}}_{1}+{a}_{2}{\mathrm{PC}}_{2}+\mathrm{...}$ . Points synthesized in this way would lie approximately on the low-dimensional hyperplane spanned by the original data set. The projections of the original points correspond to particular values for these new "coordinates" ${a}_{i}$. Being able to interpolate other points not in the original data set is a useful property that other dimensionality reduction methods do not have.

There is one detail left to introduce before going into the mechanics of computing the principal components. Notice that in figure 3-a the data set is not shown at any particular position in 3D space, that is, the data set could be centered at any point in space, not necessarily around the origin $\mathrm{(0,0,0)}$. However, in order to compute the principal components, compute the low-dimensional projections, and synthesize new points, the data set is assumed to be centered in every direction. In other words, the data set needs to have its centroid removed before the computations. If new points need to be synthesized using the PCs, the centroid can be added to place the newly syntesized points in the correct region of space.

To compute the principal components, let $X$ be an nxM matrix that contains n M-dimensional data points in its columns. Furthermore, assume that the mean for all dimensions is zero, i.e., the data are centered. The goal is to find an MxM orthonormal transformation matrix $P$ containing the PCs, such that:

- $Y={P}^{T}X$, where the columns of $Y$ are the projections onto the PCs.
- $P{P}^{T}=I$, that is, $P$ is orthonormal.
- $Y{Y}^{T}=D$, the covariance matrix of the projected points $Y$, is a diagonal matrix, so that the resulting projections are uncorrelated.

The resulting covariance matrix $Y{Y}^{T}$ can be re-written as:

We want $Y{Y}^{T}$ to be a diagonal matrix $D$ so we can write:

So then, by multiplying by $P$ to the left and ${P}^{T}$ to the right:

Since $P{P}^{T}=I$. Note that the Singular Value Decomposition (SVD) of $X{X}^{T}$ yields:

Where $V$ and $W$ are the left and right eigenvectors of $X{X}^{T}$, and $S$ is a diagonal matrix with the eigenvalues. But since $X{X}^{T}$ is a symmetric matrix by construction, the left and right eigenvectors coincide, so $W=V$. So we can write:

This fact, together with the fact that $P$ and $V$ are orthonormal, means that $P=V$ and $D=S$ (because both $D$ and $S$ are diagonal). In other words, the Principal Components, or PCs, of the data set $X$ are given by the eigenvectors of the covariance matrix $X{X}^{T}$ of the original (centered) data. Moreover, the diagonal matrix of eigenvalues, $S$, is equal to the matrix $D$, which is the covariance of the projected points $Y$. Since it is a diagonal matrix, the diagonal contains the variance of the projected data set along each dimension. By retrieving the eigenvectors (or PCs) in order of decreasing eigenvalue, the PCs that explain the most data variance are automatically chosen before the rest. Since most SVD computations return the eigenvectors and eigenvalues in this order anyway, the resulting PCs are already sorted by variance.

It is important to note at this point that the eigenvalues actually correspond to the variance along each PC. By computing the ratio of each eigenvalue ${s}_{i}$ to the total sum, one can obtain the fraction of total variance explained by each PC when the data is projected onto them. Subtracting the sum of variance fraction for the first $d$ PCs from 1, we can obtain what is known as the residual variance ${r}_{d}$, that is, the amount of variance in the original data left unexplained by discarding the PCs corresponding to the lower M-$d$ eigenvalues:

which is a typical measure of the error made by approximating the data set using only the first $d$ principal components.

The above derivation can be written as an algorithm fairly easily.

PCA algorithm

- Let the input data consist of n observations ${x}_{i}$, each of dimensionality M. Construct an nxM matrix $X$ of centered observations by subtracting the data mean from each point, so that ${X}_{\mathrm{ij}}={x}_{\mathrm{ij}}-{\mathrm{<x>}}_{j}$
- Construct the covariance matrix $C=X{X}^{T}$
- Compute the top $d$ eigenvalues and corresponding eigenvectors of $C$, for example by performing an SVD of $C$.
- The first $d$ Principal Components (PCs) of the data are given by the eigenvectors, which can be placed in a $d$xM matrix $P$. The residual variance can be computed from the eigenvalues as explained above.
- To project the original (centered) points into the optimal $d$-dimensional hyperplane, compute the dot product of each point with the PCs to obtain the projections ${y}_{i}$. This can be written as $Y={P}^{T}X$.

PCA is very well established as a dimensionality reduction technique and efficient algorithms with guaranteed convergence for its computation are readily available. Software packages that perform SVD and PCA are freely available and trivial to use. For example Matlab has built-in commands for both, and moderately experienced C and Fortran programmers can use the popular and extremely-efficient LAPACK linear algebra package. The concepts explained in this section have not assumed any particular number of dimensions. Even though a 3D example was given at the beginning, the concepts can be lifted to deal with spaces of arbitrary dimensionality, where the input data set can have a large dimension M. PCA has the advantage over other available methods that the principal components have a direct physical interpretation, especially when working with molecular motion data.

In structural bioinformatics we want to apply PCA to a set of molecular conformations, which will serve as our high-dimensional points. The input dimensionality of each point is 3N, where N is the number of atoms in the molecule. We will have n such conformations, that have been gathered through some form of sampling (for example through molecular dynamics simulations), and we want to reduce the dimensionality of each "point" (conformation) for analysis purposes. The data used as input for PCA is in the form of several atomic position vectors corresponding to different structural conformations which together constitute a vector set. Each vector in the conformational vector set has dimension 3N and is of the form [${x}_{1}$, ${y}_{1}$, ${z}_{1}$, ${x}_{2}$, ${y}_{2}$, ${z}_{2}$,..., ${x}_{N}$, ${y}_{N}$, ${z}_{N}$], where [${x}_{\mathrm{i}}$, ${y}_{\mathrm{i}}$, ${z}_{\mathrm{i}}$] corresponds to the Cartesian coordinates of the ${i}^{\mathrm{th}}$ atom.

However, before running the PCA procedure outlined above, all conformations need to be aligned with a reference structure first, as discussed in the module Molecular Distance Measures. The reason why alignment is important is that most simulation packages model the exchange of heat between the molecule and a thermal bath, in the form of random velocity perturbations on the atoms. These perturbations will in general add non-zero linear and angular momenta to the molecule structure, and not all simulation packages remove them. As a result, molecular conformations that are almost identical in shape but translated/rotated with respect to each other will have significantly different coordinates, and will be considered as different 3N-dimensional points. The reference structure (to which all structures should be aligned to) can be chosen from the set (e.g., the native structure) or provided from outside the set. Results may vary a little, but aligning all conformations to the same reference structure yields comparable results in general.

After all conformations have been aligned, the PCA procedure can be used exactly as detailed above. The first step is to determine the average vector from the conformational vector set, so it can be subtracted from all conformations to build the matrix $X$. This is done by computing an average conformation that contains the average for all 3N dimensions of the data set. It is important to note that this "average conformation" no longer represents a physically feasible molecule, since the Cartesian coordinates of all atoms are being averaged throughout the entire data set. When this "average conformation" is subtracted from the aligned input conformations, the "centered" data now becomes atomic displacements, not positions, and so building a covariance matrix for this data makes sense.

The PCs for molecular conformation data can be used for the two purposes explained before, i.e., to obtain a low-dimensional representation of each point and to synthesize (or interpolate) new conformations by following the PCs. Now, the PCs have a physical meaning: they represent the "main directions" followed by the molecule's 3N degrees of freedom, or, in other words, the directions followed collectively by the atoms. Interpolating along each PC makes each atom follow a linear trajectory, that corresponds to the direction of motion that explains the most data variance. For this reason, the PCs are often called Main Modes of Motion or Collective Modes of Motion when computed from molecular motion data. Interpolating along the first few PCs has the effect of removing atomic "vibrations" that are normally irrelevant for the molecule's bigger scale motions, since the vibration directions have little data variance and would correspond to the last (least important) modes of motion. It is now possible to define a lower-dimensional subspace of protein motion spanned by the first few principal components and to use these to project the initial high-dimensional data onto this subspace. Since the PCs are displacements (and not positions), in order to interpolate conformations along the main modes of motion one has to start from one of the known structures and add a multiple of the PCs as perturbations. In mathematical terms, in order to produce conformations interpolated along the first PC one can compute:

Where $c$ is a molecular conformation from the (aligned) data set and the interpolating parameter can be used to add a deviation from the structure $c$ along the main direction of motion. The parameter can be either positive or negative. However, it is important to keep in mind that large values of the interpolating parameter will start stretching the molecule beyond physically acceptable shapes, since the PCs make all atoms follow straight lines and will fairly quickly destroy the molecule's topology. A typical way of improving the physical feasibility of interpolated conformations is to subject them to a few iterations of energy minimization following some empirical force field.

Although it is possible to determine as many principal components as the number of original variables (3N), PCA is typically used to determine the smallest number of uncorrelated principal components that explain a large percentage of the total variation in the data, as quantified by the residual variance. The exact number of principal components chosen is application-dependent and constitutes a truncated basis of representation.

As an example of the application of PCA to produce low-dimensional points for high-dimensional input, consider the Cyanovirin-N (CV-N) protein depicted in figure 4 a), corresponding to PDB code 2EZM. This protein acts as a virucidal for many viruses, such as HIV. Protein folding simulations of this protein can be used as input for a PCA analysis. A model of this protein that considers atoms at the alpha-carbon positions only has 101 such atoms, for a total of 303 degrees of freedom. Folding/unfolding simulations starting from the native PDB structure produce abundant conformation samples of CV-N along the folding reaction.

Figure 4 b) shows the projection of each simulated conformation onto the first two principal components. Each point in the plot corresponds to exactly one of the input conformations, which have been classified along the first two PCs. Note that even the first PC alone can be used to approximately classify a conformation as being "folded" or "unfolded", and the second PC explains more of the data variability by assigning a wider spread to the unfolded region (for which the coordinate variability is much larger). The low-dimensional projections can be used as a basis to compute probability and free-energy plots, for example as in Das et al. Figure 4 c) shows such a plot for CV-N, which makes the identification of clusters easier. Using PCA to interpolate conformations along the PCs is not a good idea for protein folding data since all of the protein experiences large, non-linear motions of its atoms. Using only a few PCs would quickly destroy the protein topology when large-scale motions occur. In some cases, however, when only parts of a protein exhibit a small deviation from its equilibrium shape, the main modes of motion can be exploited effectively as the next example shows.

As an example of the application of PCA to analyze the main modes of motion, consider the work of Teodoro et al., which worked on the HIV-1 protease, depicted in figure 5. The HIV-1 protease plays a vital role in the maturation of the HIV-1 virus by targeting amino acid sequences in the gag and gag-pol polyproteins. Cleavage of these polyproteins produces proteins that contribute to the structure of the virion, RNA packaging, and condensation of the nucleoprotein core. The active site of HIV-1 protease is formed by the homodimer interface and is capped by two identical beta-hairpin loops from each monomer, which are usually referred to as "flaps". The structure of the HIV-1 protease complexed with an inhibitor is shown in figure 5. The active site structure for the bound form is significantly different from the structure of the unbound conformation. In the bound state the flaps adopt a closed conformation acting as clamps on the bound inhibitors or substrates, whereas in the unbound conformation the flaps are more open. Molecular dynamics simulations can be run to sample the flexibility of the flaps and produce an input data set of HIV-1 conformations to which PCA can be applied. A backbone-only representation of HIV-1 has 594 atoms, which amounts to 1,782 degrees of freedom for each conformation.

Applying the PCA procedure as outlined before to a set of HIV-1 samples from simulation produces 1,782-dimensional principal components. Since the physical interpretation of the PCs is quite intuitive in this case, the PC coordinates can be split in groups of 3 to obtain the $\mathrm{(x,y,z)}$ components for each of the 594 atoms. These components are 3-dimensional vectors that point in the direction each atom would follow along the first PC. In figure 6 a), the per-atom components of the first PC have been superimposed in purple.

Figure 6 b) shows the effect of interpolating HIV-1 conformations along the first PC, or first mode of motion. Starting from an aligned conformation from the original data set, multiples of the first PC can be added to produce interpolated conformations. Note that the first mode of motion corresponds mostly to the "opening" and "closing" of the flaps, as can be inferred from the relative magnitued of the first PC components in the flap region. Thus, interpolating in the direction of the first PC produces an approximation of this motion, but using only one degree of freedom. This way, the complex dynamics of the system and the 1,782 apparent degrees of freedom have been approximated by just one, effectively reducing the dimensionality of the representation.

Figure 7 (solid line) shows the residual variance left unexplained by discarding the lower-ranked principal components. Residual variance plots always decrease, and in this case, the first PC accounts for approximately 40% of the total variance along the motion (the dashed line shows the percentage of total variance explained up to the given PC). Also, the first 10 PCs account for more than 70% of the total data variance. Given the dominance of only a few degrees of freedom it is possible to represent the flexibility of the protein in a drastically reduced search space.

PCA falls in the category of what is called a linear method, since mathematically, the PCs are computed as a series of linear operations on the input coordinates. Linear methods such as PCA work well only when the collective atom motions are small (or linear), which is hardly the case for most interesting biological processes. Non-linear dimensionality reduction methods do exist, but are normally much more computationally expensive and have other disadvantages as well. However, non-linear methods are much more effective in describing complex processes using much fewer parameters.

As a very simple and abstract example of when non-linear dimensionality reduction would be preferred over a linear method such as PCA, consider the data set depicted in figure 8 a). The data is apparently two-dimensional, and naively considering the data variance in this way leads to two "important" principal components, as figure 8 b) shows. However, the data has been sampled from a one-dimensional, but non-linear, process. Figure 8 c) shows the correct interpretation of this data set as non-linear but one-dimensional.

Most interesting molecular processes share this characteristic of being low-dimensional but highly non-linear in nature. For example, in the protein folding example depicted in figure 1, the atom positions follow very complicated, curved paths to achieve the folded shape. However, the process can still be thought of as mainly one-dimensional. Linear methods such as PCA would fail to correctly identify collective modes of motion that do not destroy the protein when followed, much like in the simple example of figure 8.

Several non-linear dimensionality reduction techniques exist, that can be classified as either parametric or non-parametric.

- Parametric methods need to be given a model to try to fit the data, in the form of a mathematical function called a kernel. Kernel PCA is a variant of PCA that projects the points onto a mathematical hypersurface provided as input together with the points. When using parametric methods, the data is forced to lie on the supplied surface, so in general this family of methods do not work well with molecular motion data, for which the non-linearity is unknown.
- Non-parametric methods use the data itself in an attempt to infer the non-linearity from it, much like deducing the inherent connectivity in figure 8 c). The most popular methods are Isometric Feature Mapping (Isomap) from Tenenbaum et al. and Locally Linear Embedding (LLE) from Roweis et al..

The Isomap algorithm is more intuitive and numerically stable than LLE and will be discussed next.

Isomap was introduced by Tenenbaum et al. in 2000. It is based on an improvement over a previous technique known as MultiDimensional Scaling (MDS). Isomap aims at capturing the non-linearity of a data set from the set itself, by computing relationships between neighboring points. Both MDS and the non-linear improvement will be presented first, and then the Isomap algorithm will be detailed.

MDS. Multidimensional Scaling (see Cox et al.) is a technique that produces a low-dimenisional representation for an input set of n points, where the dimensions are ordered by variance, so it is similar to PCA in this respect. However, MDS does not require coordinates as input, but rather, distances between points. More precisely, MDS requires as input a data set of points, and a distance measure $\mathrm{d(i,j)}$ between any pair of points ${x}_{i}$ and ${x}_{j}$. MDS works best when the distance measure is a metric. MDS starts by computing all possible pairwise distances between the input points, and placing them in a matrix $D$ so that ${D}_{\mathrm{ij}}=\mathrm{d(i,j)}$. The goal of MDS is to produce, for every point, a set of n Euclidean coordinates such that the Euclidean distance between all pairs match as close as possible the original pairwise distances ${D}_{\mathrm{ij}}$, as depicted in figure 9.

If the distance measure between points has the metric properties as explained above, then n Euclidean coordinates can always be found for a set of n points, such that the Euclidean distance between them matches the original distances. In order to obtain these Euclidean coordinates, such coordinates are assumed to have a mean of zero, i.e., they are centered. In this way, the matrix of pairwise distances $D$ can be converted into a matrix of dot products by squaring the distances and performing a "double centering" on them to produce the matrix $B$ of pairwise dot products:

Let's assume that n Euclidean coordinates exist for each data point and that such coordinates can be placed in a matrix $X$. Then, multiplying $X$ by its transpose should equal the matrix $B$ of dot products computed before. Finally, in order to retrieve the coordinates in the unknown matrix $X$, we can perform an SVD of $B$ which can be expressed as:

Where the left and right singular vectors coincide because $B$ is a symmetric matrix. The diagonal matrix of eigenvalues can be split into two identical matrices, each having the square root of the eigenvalues, and by doing this a solution for $X$ can be found:

Since $X$ has been computed through an SVD, the coordinates are ordered by data variance like in PCA, so the first component of $X$ is the most important, and so on. Keeping this in mind, the dimensionality reduction can be performed by ignoring the higher dimensions in $X$ and only keeping the first few, like in PCA.

Geodesic distances. The notion of "geodesic" distance was originally defined as the length of a geodesic path, where the geodesic path between two points on the surface of the Earth is considered the "shortest path" since one is constrained to travel on the Earth's surface. The concept of geodesic distance can be generalized to any mathematical surface, and defined as "the length of the shortest path between two points that lie on a surface, when the path is constrained to lie on the surface." Figure 10 shows a schematic of this generalized concept of geodesic path and distance.

The Isomap algorithm. The Isomap method augments classical MDS with the notion of geodesic distance, in order to capture the non-linearity of a data set. To achieve its goal, Isomap uses MDS to compute few Euclidean coordinates that best preserve pairwise geodesic distances, rather than direct distances. Since the coordinates computed by MDS are Euclidean, these can be plotted on a Cartesian set of axes. The effect of Isomap is similar to "unrolling" the non-linear surface into a natural parameterization, as depicted in figure 11. Isomap approximates the geodesic distances from the data itself, by first building a neighborhood graph for the data. A neighborhood graph consists of the original set of points, together with a connection between "neighboring" points (neighboring points are defined as the closest points to a given point, as determined by the distance measure used) as shown in figure 11-b. After the neighborhood graph has been built, it can be used to approximate the geodesic distance between all pairs of points as the shortest path distance along the graph (red path in figure 11-b). Naturally, the sampling of the data set has to be enough to capture the inherent topology of the non-linear space for this approximation to work. MDS then takes these geodesic distances and produces the Euclidean coordinates for the set. Figure 11-c shows two Euclidean coordinates for each point in the example.

The Isomap method can be put in algorithmic form in the following way:

The Isomap algorithm

- Build the neighborhood graph. Take all input points and connect each point to the closest ones according to the distance measure used. Different criteria can be used to select the closest neighbors, such as the $k$ closest or all points within some threshold distance.
- Compute all pairwise geodesic distances. These are computed as the shortest paths on the neighborhood graph. Efficient algorithms for computing all-pairs-shortest-paths exist, such as Dijkstra's algorithm. All geodesic distances can be put in a matrix $D$.
- Perform MDS on geodesic distances. Take the matrix $D$ and apply MDS to it. That is, apply the double-centering formula explained above to produce $B$, and compute the low-dimensional coordinates by computing $B$'s eigenvectors and eigenvalues.

The Isomap algorithm captures the non-linearity of the data set automatically, from the data itself. It returns a low-dimensional projection for each point; these projections can be used to understand the underlying data distribution better. However, Isomap does not return "modes of motion" like PCA does, along which other points can be interpolated. Also, Isomap is much more expensive than PCA, since building a neighborhood graph and computing all-pairs-shortest-paths can have quadratic complexity on the number of input points. Plus, ther is the hidden cost of the distance measure itself, which can also be quite expensive. Regardless of its computational cost, Isomap has proven extremely useful in describing non-linear processes with few parameters. In particular, the work in Das et al. applied Isomap successfully to the study of a protein folding reaction.

In order to apply Isomap to a set of molecular conformations (gathered, for example, through molecular dynamics simulations) all that is needed is a distance measure between two conformations. The most popular distance measure for conformations is lRMSD, discussed in the module Molecular Distance Measures. There is no need to pre-align all the conformations in this case, since the lRMSD distance already includes pairwise alignment. Thus, the Isomap algorithm as described above can be directly applied to molecular conformations. Choosing an appropriate value for a neighborhood parameter (such as $k$) may require a bit of experience, though, and it may depend on the amount and variability of the data. It should be noted that now, the shape of the low-dimensional surface where we hypothesize the points lie is unknown to us. In the swiss roll example above, it was obvious that the data was sampled from a two-dimensional surface. For molecular data, however, we do not know, a priori, what the surface looks like. But we know that the process should be low-dimensional and highly non-linear in nature. Of course, the distance measure used as an underlying operation has an impact on the final coordinates as well. The validation of the Isomap method using the lRMSD distance on protein folding data was done in Das et al., where a statistical analysis method was used to prove its usefulness.

Figure 12 shows the results of applying Isomap to the same molecular trajectory used for the CV-N example in the PCA section. CV-N is a protein that is known to have an "intermediate" state in the folding mechanism, and it is also known to fold through a very ordered process following a well-defined "folding route".

The free-energy plot in figure 12 (right) clearly shows the superior quality of the Isomap coordinates when compared with PCA. Isomap clearly identifies the CV-N intermediate (seen as a free-energy minimum in blue) in between the unfolded and folded states. Also, the highest probability route connecting the intermediate to the folded state is clearly seen in this rendering. The PCA coordinates did not identify an itermediate or a folding route, since the PCA coordinates are simply a linear projection of the original coordinates. Isomap, on the contrary, finds the underlying connectivity of the data and always returns coordinates that clearly show the progression of the reaction along its different stages.

The Isomap procedure can be applied to different molecular models. As another, smaller example, consider the Alanine Dipeptide molecule shown in figure 13 (left). This is an all-atom molecule that has been studied thoroughly and is known to have two predominant shapes: an "extended" shape (also called C5 and a variant called PII) and a "helical" shape (also called alpha, with three variants: right-handed, P-like, and left-handed). The application of the Isomap algorithm to this molecule, without any bias by a priori known information, yields the low-dimensional coordinates on the left of figure 13 (shown again as a free-energy plot). The first two coordinates (top right) are already enough to identify all five major states of the peptide, and the possible transitions between them. Applying PCA to such a simple molecule yields comparable results (not shown), but PCA cannot differentiate the left-handed helix shape from the right-handed one. Only the geodesic formulation of Isomap can.

Although the Isomap coordinates describe non-linear molecular processes extremely well, their computation is very expensive, as mentioned earlier. The most expensive step is the computation of the neighborhood graph. Some approximations are possible to speed up its computation, at the expense of some accuracy in the identification of the "true" nearest neighbors for each point. The work in Plaku et al. proposed the use of an approximation technique that allows the computation of nearest-neighbors for high-dimensional data sets orders of magnitude faster, by quickly reducing the dimensionality of all points, and then querying for neighbors. The tradeoff between precision and speed is controllable through several algorithm parameters.