Rev: 10 dec 02, notes added 19 Jan 03

Evolving Draft -- Not for Distribution
This is a draft of a paper being written by CSC181 at Dowling College

Note added 19 Jan 03: The NCUR abstract has been accepted. See http://www.unca.edu/ncur-proc/Abstract.html for the rules we must follow. The biggest problem is a strict 8 page limit. I have prepared a partial draft DRAFT_NCUR_Paper.pdf with the descriptions of the programs greatly condensed. This leaves us about 1 3/4 pages for discussion and conclusions. Please email me corrections and changes and material you think should be included in that remaining space.

Deadline for all input: 28 January 2003.

Deadline for us to get the final paper out the door: 7 February 2003

-- HJB

Note added 10 Dec: Added summary drawings to results. -- HJB

Note added 6 Dec: Material from Parag on Masker, Daniel on DMS and Petko in MSMS. -- HJB

Note added 1 Dec: Material from Ventsi on ASC added. Intercomparable results are needed ASAP. -- HJB

Gentlemen: Here it is. I've done a lot of editing. It is now tomorrow, so I will stop and let you take your turn. I hope each of you will read this, think about it and write good material. Improve what I have done and work towards a uniform style. Check my spelling and references. Provide good, solid material for the sections we still need. I hope you will email me a steady stream of material over the next two weeks. -- HJB

Comparison of Recent Algorithms for Computing Molecular Surfaces

by

Petko S. Ivanov, Parag Jain, Daniel Osei-Kuffuor,
Vencislav S. Stanev, Rohit Tripathi, Peter N. Zhivkov,
Herbert J. Bernstein

Department of Mathematics and Computer Science
Dowling College, Oakdale, NY 11769-1999

Abstract

Understanding of so-called molecular surfaces is important in structural biology and in drug design. In 1971, Lee and Richards defined a molecular surface as the surface resulting from rolling a probe sphere on the van der Waals surface of a molecule. We compare recent algorithms for computing molecular surfaces. This work is ongoing, but results to date show that there are significant differences in the performance of some recently proposed algorithms, with some of them achieving performance suitable for use in interactive graphics programs (at least for smaller molecules), and some being suitable only for off-line computations of surfaces. A surprising result of our investigations has been the difficulty in computing surfaces of molecules composed of nucleic acids as opposed to amino acids. We conjecture that the difference is due to the lower degree of folding of the former. An exception is the VEGA program by Pedretti, which does not have an anomaly in the timing for nucleic acids, but which is, unfortunately one of the slower programs with what appears to be n-squared growth of time. The faster programs include Sanner's MSMS, Vorobjev's SIMS and Eisenhaber's NSC and ASC.

Introduction

When using molecular graphics to dock two molecules, as in interactive drug design, it helps to assign a surface to one or both molecules. While not strictly physical, such a surface conveniently limits the extent to which molecules may interpenetrate, helping to estimate key-in-lock fits. (See [Dickerson, Geis 1969, p68]). Early surface renderings were done with space-filling Corey-Pauling-Koltun (CPK) [Corey, Pauling 1953] models, creating an implicit surface from the visible exterior of intersecting van der Waals spheres. Such a surface is useful, but has distracting features. Smoother surface representation of a surface allows better understanding of the fit between molecules. The commonly accepted defintion of a such a smoothed molecular surface was provided by Lee and Richards [Lee, Richards 1971]. Lee and Richards defined a molecular surface as the surface resulting from rolling a probe sphere on the van der Waals spheres of a CPK model of the molecule. The result is a surface which agrees with the CPK model where the van der Waals spheres stick out, but replaces the inward pointing sharp crevices of a CPK model with smooth torii and spherical triangles. For a review of the concepts and algorithmic developments related to molecular surfaces from 1971 until 1996 see M. Connolly's review [Connolly 1996]. Calculation of molecular surfaces can be computationally intensive. As interest focuses on larger and larger molecules, questions of efficiency of various approaches to such calculations become important. In this paper we consider the performance of several recent algorithms for computing molecular surfaces.

Some algorithms produce "better" results than others. ....This section to be filled in by P. Zhivkov and others

Our concern is with the use of molecular surface algorithms for rendering. It is natural to maintain a database of atomic positions in a general purpose rendering system. In computing molecular surfaces this then favors a two stage approach, of first identifying atoms near enough to one another to help to shape common portions of the surface, and then to render the surface components involved. Deciding which of n atoms are near one another by naive pairwise comparison leads to n-squared growth in time. In the mid 1960's Levinthal introduced what would now be called a divide-and-conquer approach to the calculation. At that time he called it "cubing", and that name is still used in molecular graphics. For an accessible reference, see [Katz, Levinthal 1972].

An alternative approach to deciding on which atoms are neighbors to one another would be to calculate the Voronoi diagram ***[ref ???]*** or Delaunay diagram ***[ref ???]*** of the molecule. These computational geometry techniques are functionally equivalent to cubing-based neighbor calculations. In some cases they are explicitly based on divide-and-conquer style cubing. For this portion of the surface calculation, explicit voronoi diagram based techniques do not seem to have a computational advanatage.

The second phase of the calculation is the actual rendering of the surface. Another, related calculation is the computation of the surface area or the volume contained within the surface. Voronoi diagram techniques are useful in efficiently deriving the surface area and the volume, but for rendering, in the absense of specialized hardware, the second phase appears to be dominated by the placement of pixels.

Algorithm and Programs Considered

We have conducted timing experiments with Sanner's MSMS, Vorobjev's SIMS, Eisenhaber's ASC, Bystroff's version of Le Grand and Merz's Masker, Huang's DMS and Pedretti, Villa and Vistoli's VEGA.

MSMS

Maximal Speed Molecular Surface (MSMS) [Sanner 1995, 1996] is based on a recursive divide and conquer algorithm. MSMS constructs a surface called the "r-reduced surface of a set of atoms s" (Rr(S)) to fascilitate molecular surface computations. Rr(S)' outer component is computed in O[n*log(n)] and its corresponding r-Excluded surface component (Er(S)) in O[n] for n "calibrated spheres". "A set of spheres is calibrated if the ratio of the probe radius to the spheres representing atoms is bound by a constant and the distance between spheres representing atoms has a lower and an upper bound." [Sanner 2002] In building the Rr(S), because the surface is calibrated, the number of current edge's neighbors to consider is O[log(n)]. This is achieved using a hierarchical subdivision of space obtained by recursively dividing the bounding box of the set of atom's centers into two subsets of comparable sizes. Since Rr(S) has O[n], edges Rr(S) can be computed in O[n*log(n)].

Er(S) is built from Rr(S) in complexity is a linear function to the number of faces and edges of Rr(S).

SIMS

Smooth Invariant Molecular Surfaces (SIMS) [Vorobjev 1997] is largely based on Connolly's method [Connolly 1983] with some modifications. It starts with a set of atomic coordinates, atomic radii, and a probe sphere radius. Neighbor identification for a specific atom I inside the sphere of extended radius (the radius of atom I + the radius of the probe sphere) is performed by cubing. Next, the computation of the MS proceeds as in Connoly's method: all contact triplets that can be touched simultaneously by the probe sphere are identified and respective probe sphere positions are calculated (vertices of tangency, etc) employing Connoly's formulae. These form spherical triangles from reentrant concave (RC) faces. All solvent probe positions for pairs of atoms are, next, calculated (probe sphere starting and stopping vertices, radii of rotation, etc), yielding saddle faces which are joined to the RC faces at common boundary arcs. And last, the contact portions, which result in a cutoff from the van derWaals spheres by the atom-torus contact planes, are calculated. The areas of all surface element are calculated analytically with Connolly's formulae.

Singularities are smoothed by using a second probe sphere of smaller radius ( <= 5 Å) rolled over the outward surface of the intersecting solvent probe spheres. The singularity is replaced by this new surface. Very homogeneous dot surfaces are produced by a constant arc length method. Invariance under rotation is achieved by using local coordinate systems. The claimed time complexity is claimed to be O[n].

ASC

The Analytic Surface Computation (ASC) package from Eisenhaber also includes a Numercial Surface Calculation (NSC). NSC is based on the Double Cubing Lattice Method (DCLM) [Eisenhaber 1995].

DCLM involves cubing techniques applied twice. The first time is when determining the neighboring atoms in the molecule and the second one is when computing overlapping surfaces.

Step number one involves cubing with a fixed side cube equal to 2rmax, where r is the radius of the probe sphere + the radius of the atom under consideration. rmax is simply the biggest r of all atoms. DCLM starts with determining the minimum and the maximum of the x-, y- and z- coordinates of all atoms in the molecule under consideration. The smallest rectangular box which is able to fit the molecular structure and has dimensions equal to a multiple of 2rmax is calculated. The box is then subdivided into cubes of size 2rmax and each atom is assigned a cube number (if an atom lies in box (ix, iy, iz) then the box number is equal to ix + (iy*bx) + (iz*bx*by). All atoms are sorted according to box number. The result of this operations is that all atoms belong to a cube and all possible occluding neighbors are either is the same cube or the 26 adjacent cubes. This concludes step one.

Step two is checking for occlusion between atoms. The surface of each atom is represented by evenly distributed dots (the so called dot-surface). Each of those dots has to be checked for occlusion. To speed up the process a second cubing lattice in a cube that is inscribed by the unit sphere is used.To compute the surface of an atom i, an occlusion check on all dots k that belong to i is done. To minimize search time for a neighbor occluding the dot the algorithm first tries the neighbor that occluded the previous dot.In order this heuristic to work the dots should be ordered by spatial proximity.

Step three is calculating the surface area.The Area A is obtained by

where macc(i) is the number of dots on atom I not occluded by neighboring atoms.

ASC [ref ???] involves three steps:

  1. "formation of the neighboring list of atoms- classes of atoms are introduced( a residue or a space cube) and then the distance to the class center is checked while having in mind the dimensions of the class. Atoms are grouped in residues in the ASC inmplementation.
  2. "determination of list of fully or partly accessible intersection circles and vertices this step has to following features- assignment of states to intersection circles,two tests as sufficient criteria to complete nonacessibility for small and midium sized intersection circles, a test to check whether two symetric vertices with small vertex heights are buried and a search procedure that with almost no sqare root extractions."
  3. closure of cycles and area calculation This step involves checking for cycles in the intersection circles and removing them( finding the outter shell of the surface) and actual calculation of the surface

MASKER

Masker [Bystroff 2001] is Christopher Bystroff's version of LeGrand and Merz's program [LeGrand, Merz 1993] of the same name that used Boolean masks to compute a solvent accessible surface (SAS) -- an expanded van der Waals surface. Bystroff modified the original algorithm to estimate the solvent excluded molecular surface. Numerical estimates of arc lengths of intersecting atomic SAS are used to estimate the toroidal surface, and intersections between those arcs are used to estimate the reentrant surface area.

A Boolean masks approach for computing the SAS was first proposed by LeGrand and Merz. In their approach, a set of points were evenly spaced on the surface of a sphere centered at the origin and having radius 1.0, and each was represented by a single bit in a multi-byte word. A set of 'masks' was created by placing a probe sphere of unit radius at all positions around the origin, and all distances from zero to two. Points within the probe radius had their bits set to zero, while the remaining bits were set to one. Therefore, one multi-byte word (mask) corresponded to one probe sphere location. A binary AND operation over any number of masks resulted in another mask whose nonzero bits represent the SAS for one atom. The SAS is estimated quickly and accurately by summing the 1-bits and scaling. Bystroff's improved MASKER takes the method of LeGrand and Merz one step further by using masks to calculate SES. The problem is broken down to its three component parts. The contact surface is a scalar multiple of SAS. The toroidal surface can be estimated based on the length of exposed circular edge at the intersection of two atom surfaces. The reentrant surfaces are calculated by placing a mask at each position where a probe is in contact with exactly three atoms. Intersections and self-burial of toroidal and reentrant surfaces are accounted for by additional masking operations, as are the pairwise partial derivatives.

According to Bystroff, the runtime [of Masker] is a polynomial with an exponent of approximately 1.42 and the process is dominated by the masking operations which consumes about 73% of the CPU time.

DMS

Distributed Molecular Surface (DMS) by Conrad Huang is part of the MIDAS system [Ferrin, Huang, Jarvis, Langridge 1988]. DMS can be run on one machine or on multiple servers simultaneously. The algorithm is structured to facilitate parallel computation. The atom list is sorted first by the x coordinate, then by the y-coordinate and finally by the z coordinates. Neighbors are then computed and then surface components. The initial sort allows for computation in x-coordinate slabs. This provides a division of labor in one dimension, but with n-squared growth in the other two dimensions. Lists of surface points are generated.

The sequence of events is:

VEGA

VEGA [Pedretti, Villa, Vistoli 2002] is a molecular graphics program that includes the ability to compute surfaces. The calculation is based on a creation of dot surfaces and removal of dots that fall in the interior of the molecule.

" A sphere of 1 Å radius is generated at the (0, 0, 0) coordinates using the dot density specified by the user. For each atom (Ai), this sphere is scaled multiplying the dot coordinates by the correspondent Van Der Waals radius (Ri) and translated to the atom coordinates (Xi, Yi, Zi). The dots overlapped to near atoms are discarded. This procedure is performed measuring the distance between the surface dots of the "Ai" atom and all atomic coordinates for all "n" atoms: if it's less than the "An" Van der Waals radius, the dot is overlapped. This algorithm takes O(n2) computation time."

Experimental Methodology

The objective of the experiment is to determine the performance of the algorithms described above. The independent variable is number of atoms and the dependent variable is time consumed for area computation. The input data is divided into two different categories according to shape of the molecule : the first one is composed of proteins ( folded, dense molecules) and the second one is DNA/RNA (long strung out molecules). All input data has been downloaded from the Protein Data Bank (www.rcsb.org)[Reference] in pdb format.

The experiment was conducted on a Dell Dimension 2300 workstation. The machine runs Red Hat Linux 8.0. It has an Intel 845GL chipset Pentium 4 2GHz processor with 512 MB SDRAM. Only the vital processes of the Operating System were left running while performing the computations and the user time consumed by the different algorithms was measured. Standard atom radii were used as defined in [Bondi 1964] with a probe radius of 1.4Å

Experimental Results

The experimental results are summarized in the following three figures.  Each figure is drawn with the y-axis being a log-plot time in seconds.  The x-axis is the number of atoms used.  The first plot shows times to compute surfaces of Proteins, the second plot shows times to compute surfaces of DNAs and the third plot shows times to compute surfaces of RNAs.  For each program a curve of the form t = a nb has been fit. While this is not the "natural" fit for all the algorithms, it is a reasonable fit for many of them, and allows for a performance comparison among them.

Results from ASC:

The input data for proteins consists of 30 pdbs. The table below represent the results for ASC and NSC together. The first column is the pdb ID as listed in the Protein Data Bank. The second column is number of atoms. Where there are two entries in this column the first is the number of atoms as given in the pdb format and the second one is the actual atoms selected for computation. For the small pdbs the difference is insignificant but for the large pdbs it is noted. The ASC_1 column represent the time for the first step in ASC. The ASC_2 column represents time taken for step 1 + step 2. The third ASC column has the total time ASC takes for its three main steps. The NSC column has the total time taken for computing the surface of a molecule using the NSC algorithm.

protein number of atoms ASC_1   ASC_2   ASC_3   NSC
4INS    1,158           00.01   0.08    0.09    0.04
1FUA    1,727           0.03    0.16    0.17    0.07
3SEB    2,125           0.03    0.21    0.23    0.10
1AF7    2,361           0.03    0.23    0.25    0.12
1MX2    2,567           0.03    0.24    0.26    0.13
1ADS    2,665           0.04    0.27    0.29    0.12
1A4R    3,087           0.05    0.32    0.35    0.16
1A1N    3,511           0.05    0.33    0.38    0.15
1AOP    4,153           0.06    0.40    0.44    0.18
1FMT    4,845           0.07    0.51    0.55    0.25
1E7O    5,122           0.08    0.49    0.56    0.20
4PGA    5,387           0.09    0.56    0.65    0.24
1A6V    5,792           0.09    0.62    0.68    0.27
1A6Z    6,099           0.11    0.66    0.75    0.31
1AD9    6,760           0.11    0.73    0.82    0.34
1A2K    7,029           0.12    0.74    0.87    0.36
1HMV    7,399           0.12    0.79    1       0.38
1A2Z    7,652           0.12    0.81    0.96    0.37
1BBR    8,084           0.13    0.82    0.91    0.38
1AFC    8,312           0.12    0.84    0.90    0.39
1AFV    9,024           0.15    0.99    1.17    0.5
1A0S    9,939           0.18    1.12    1.52    0.47
1IAS    13,187          0.22    1.52    1.86    0.62
1AIG    14,142  13,014  0.22    1.47    1.92    0.69
1I84    16,580  16,491  0.25    1.74    2.74    0.79
1jmu    23,657  23,268  0.43    2.71    5.69    1.35
1A49    34,001  31,824  0.58    3.67    5.83    1.58
1RCX    39,368  37,456  0.84    4.83    8.85    1.95
2btv    49,061  49,061  0.93    5.58    12.02   2.50
1M8Q    76,872  74,892  1.35    8.33    14.66   3.80

Results from MSMS (PI):

/***********************************************************************
*       file -> pdb file                                                *
*       # atoms -> number of atoms, used in MS calculations             *
*       r-reduced -> Reduced Surface Calculations Time                  *
*       ases -> Analytical Solvent Excluded Surface                     *
*               surf -> Surface Time                                    *
*               sing -> Singularities Time                              *
*       triang -> Triangulation Time                                    *
*       total -> Total Time                                             *
************************************************************************/

DNA
============

file    # atoms r-reduced       ases(surf+sing) triang  total
-----------------------------------------------------------------
1KCI    192     0.02            0.01+0.00       0.02    0.06
1M6A    236     0.03            0.00+0.01       0.02    0.07
1LJX    300     0.03            0.01+0.00       0.03    0.08
1KSE    434     0.07            0.01+0.01       0.05    0.14
1MTG    561     0.08            0.01+0.01       0.05    0.17
1J8L    600     0.08            0.01+0.01       0.06    0.17
1M6F    639     0.07            0.01+0.00       0.06    0.16
1JPQ    642     0.05            0.01+0.01       0.05    0.13
1MKL    671     0.11            0.01+0.01       0.08    0.23
1G8N    675     0.07            0.01+0.01       0.07    0.18
1N1O    682     0.07            0.01+0.01       0.07    0.17
1K5F    713     0.11            0.02+0.01       0.07    0.23
1N2W    766     0.13            0.01+0.02       0.08    0.26
1MXK    815     0.12            0.02+0.01       0.08    0.25
1G3V    829     0.11            0.01+0.02       0.10    0.25
1K9H    886     0.15            0.02+0.02       0.10    0.31
1IH4    1083    0.12            0.01+0.02       0.09    0.26
1IH6    1202    0.12            0.02+0.02       0.10    0.29
1JRN    1234    0.11            0.01+0.01       0.10    0.26
1I3W    1318    0.12            0.01+0.01       0.09    0.27
1ILC    1572    0.21            0.03+0.02       0.17    0.49
1N1K    3780    1.14            0.02+0.02       0.09    1.31
1LEJ    10395   2.07            0.03+0.02       0.12    2.35
1N14    16460   2.48            0.03+0.02       0.13    2.80
1JU0    21856   4.16            0.05+0.05       0.23    4.70


RNA
============

file    # atoms r-reduced       ases(surf+sing) triang  total
-----------------------------------------------------------------
1EKA    512     0.07            0.01+0.01       0.04    0.15
1J4Y    548     0.08            0.01+0.01       0.05    0.17
1MHK    570     0.07            0.01+0.01       0.06    0.17
1F1T    860     0.01            0.01+0.01       0.09    0.23
1DUH    988     0.13            0.02+0.01       0.12    0.31
1JJM    1089    0.13            0.03+0.01       0.13    0.34
1KXK    1499    0.20            0.03+0.03       0.14    0.44
1FIR    1706    0.22            0.03+0.03       0.17    0.51
1I9V    1708    0.21            0.04+0.03       0.17    0.51
1EHZ    1818    0.19            0.03+0.02       0.16    0.45
1EVV    1896    0.21            0.04+0.03       0.19    0.52
1DUQ    2313    0.29            0.04+0.04       0.24    0.68
1IDV    3230    0.67            0.01+0.01       0.05    0.77
1DDY    3376    0.35            0.05+0.04       0.25    0.77
1ET4    4945    0.49            0.06+0.05       0.41    1.13
1MFJ    6410    1.52            0.03+0.01       0.12    1.77
1JUR    6435    0.52            0.03+0.02       0.13    0.76
1L8V    6960    0.93            0.14+0.10       0.67    2.07
1HR2    7048    0.90            0.13+0.10       0.65    2.02
1FCW    8260    1.02            0.12+0.10       0.64    2.10
1K4A    9060    2.30            0.02+0.02       0.08    2.51
1FG0    10667   1.54            0.22+0.17       1.12    3.42
1MFY    15840   2.04            0.03+0.02       0.14    2.36
1E95    17430   2.59            0.04+0.02       0.16    2.96
1M5L    18525   2.56            0.08+0.07       0.36    3.31
1JU1    22290   3.38            0.05+0.05       0.22    3.91
1FYO    ERROR: findBHcloseAtomsInNode: result array too small
488D            SEG FAULT


PROTEINS
============

file    # atoms r-reduced       ases(surf+sing) triang  total
-----------------------------------------------------------------
4INS    1181    0.13            0.02+0.02       0.12    0.33
1FUA    1742    0.13            0.03+0.01       0.14    0.35
3SEB    2125    0.16            0.02+0.01       0.15    0.39
1AF7    2361    0.19            0.03+0.02       0.20    0.50
1MX2    2567    0.22            0.03+0.03       0.21    0.55
1ADS    2665    0.18            0.04+0.02       0.17    0.46
1A4R    3087    0.28            0.04+0.04       0.28    0.71
1A1N    3511    0.33            0.04+0.02       0.25    0.71
1AOP    4185    0.25            0.04+0.03       0.25    0.65
1FMT    4845    0.41            0.05+0.04       0.35    0.97
1E70    5196    0.29            0.04+0.04       0.26    0.71
4PGA    5387    0.33            0.05+0.04       0.32    0.85
*1A6V   5802    0.44            0.06+0.06       0.43    1.11
1A6Z    6099    0.63            0.09+0.08       0.54    1.51
1AD9    6760    0.61            0.09+0.07       0.55    1.51
1A2K    7029    0.59            0.08+0.07       0.52    1.43
1HMV    7399    0.76            0.12+0.09       0.69    1.88
1A2Z    7666    0.50            0.07+0.06       0.44    1.22
1BBR    8084    0.77            0.10+0.07       0.60    1.72
1AFC    8312    0.10            0.02+0.00       0.08    0.26
1AFV    9024    0.89            0.13+0.10       0.78    2.15
*1A0S   9939    0.78            0.11+0.09       0.62    1.82
1IAS    13106   1.32            0.20+0.15       1.13    3.17
1AIG    13535   0.63            0.10+0.07       0.52    1.53
1I84    16536   1.71            0.28+0.20       1.67    4.36
1jmu    23268   2.10            0.30+0.23       1.77    4.98
1A49    31872   2.82            0.43+0.31       2.39    6.77
1RCX    37934   2.37            0.35+0.26       1.92    5.66
*1M8Q   75497   8.05            1.15+0.90       6.81    19.30
*2btv           SEG FAULT

Results from SIMS (PZ):


Proteins:
---------

4INS.pdb Total atoms= 1181 Total time: 1.146 s
1FUA.pdb Total atoms= 1742 Total time: 1.742 s
3SEB.pdb Total atoms= 2125 Total time: 2.014 s
1AF7.pdb Total atoms= 2361 Total time: 2.316 s
1MX2.pdb Total atoms= 2567 Total time: 2.504 s
1ADS.pdb Total atoms= 2665 Total time: 2.510 s
1A4R.pdb Total atoms= 3087 Total time: 3.363 s
1A1N.pdb Total atoms= 3511 Total time: 3.314 s
1AOP.pdb Total atoms= 4185 Total time: 3.609 s
1FMT.pdb Total atoms= 4845 Total time: 4.883 s
1E70.pdb Total atoms= 5196 Total time: 4.277 s
4PGA.pdb Total atoms= 5387 Total time: 4.854 s
1A6V.pdb Total atoms= 5802 Total time: 5.684 s
1A6Z.pdb Total atoms= 6099 Total time: 6.908 s
1AD9.pdb Total atoms= 6760 Total time: 7.611 s
1A2K.pdb Total atoms= 7029 Total time: 7.385 s
1HMV.pdb Total atoms= 7399 Total time: 9.404 s
1A2Z.pdb Total atoms= 7666 Total time: 7.074 s
1BBR.pdb Total atoms= 8084 Total time: 8.221 s
1AFC.pdb Total atoms= 8312 Total time: 9.467 s
1AFV.pdb Total atoms= 9024 Total time: 10.512 s
1A0S.pdb Total atoms= 9939 Total time: 10.217 s
1IAS.pdb Total atoms= 13181 Total time: 18.146 s
1AIG.pdb Total atoms= 14142 Total time: 17.010 s
1I84.pdb Total atoms= 16580 Total time: 30.838 s
1jmu.pdb Total atoms= 23657 Total time: 43.811 s
1A49.pdb Total atoms= 34001 Total time: 61.234 s
1RCX.pdb Total atoms= 39368 Total time: 46.691 s
2btv.pdb Total atoms= 49061 Total time: 93.037 s
1M8Q.pdb Total atoms= 76872 Total time: > 15 min

----------------------------------------------------------------

RNAs:
------

1EKA.pdb Total atoms= 512 Total time: 0.664 s
1J4Y.pdb Total atoms= 548 Total time: 0.674 s
1MHK.pdb Total atoms= 570 Total time: 0.668 s
1F1T.pdb Total atoms= 860 Total time: 0.920 s
1DUH.pdb Total atoms= 988 Total time: 1.014 s
1JJM.pdb Total atoms= 1089 Total time: 1.148 s
488D.pdb Total atoms= 1412 Total time: 0.977 s
1KXK.pdb Total atoms= 1499 Total time: 1.477 s
1FIR.pdb Total atoms= 1706 Total time: 1.707 s
1I9V.pdb Total atoms= 1708 Total time: 1.672 s
1EHZ.pdb Total atoms= 1821 Total time: 1.693 s
1EVV.pdb Total atoms= 1896 Total time: 1.947 s
1DUQ.pdb Total atoms= 2313 Total time: 2.211 s
1DDY.pdb Total atoms= 3376 Total time: 3.314 s
1ET4.pdb Total atoms= 4945 Total time: 4.691 s
1L8V.pdb Total atoms= 6960 Total time: 7.689 s
1HR2.pdb Total atoms= 7048 Total time: 8.092 s
1FCW.pdb Total atoms= 8260 Total time: 10.912 s
1FG0.pdb Total atoms= 10704 Total time: 15.092 s

1IDV.pdb Total atoms= 3230 Total time: 101.441 s
1MFJ.pdb Total atoms= 6410 Total time: 221.402 s
1JUR.pdb Total atoms= 6435 Total time: 87.754 s
1K4A.pdb Total atoms= 9060   ***
1MFY.pdb Total atoms= 15840 ***
1E95.pdb Total atoms= 17430 ***
1M5L.pdb Total atoms= 18525 ***
1FYO.pdb Total atoms= 21675 ***
1JU1.pdb Total atoms= 22290 ***


DNAs:
------
1KCI.pdb Total atoms= 192 Total time: 0.311 s
1M6A.pdb Total atoms= 236 Total time: 0.393 s
1LJX.pdb Total atoms= 300 Total time: 0.404 s
1KSE.pdb Total atoms= 434 Total time: 0.629 s
1MTG.pdb Total atoms= 561 Total time: 0.732 s
1J8L.pdb Total atoms= 600 Total time: 0.641 s
1M6F.pdb Total atoms= 639 Total time: 0.658 s
1JPQ.pdb Total atoms= 642 Total time: 0.650 s
1MKL.pdb Total atoms= 671 Total time: 0.838 s
1G8N.pdb Total atoms= 675 Total time: 0.680 s
1N1O.pdb Total atoms= 682 Total time: 0.668 s
1K5F.pdb Total atoms= 713 Total time: 0.908 s
1N2W.pdb Total atoms= 766 Total time: 0.932 s
1MXK.pdb Total atoms= 815 Total time: 0.973 s
1G3V.pdb Total atoms= 829 Total time: 0.879 s
1K9H.pdb Total atoms= 886 Total time: 1.049 s
1IH4.pdb Total atoms= 1083 Total time: 1.021 s
1IH6.pdb Total atoms= 1202 Total time: 1.105 s
1JRN.pdb Total atoms= 1234 Total time: 1.156 s
1I3W.pdb Total atoms= 1318 Total time: 1.223 s
1ILC.pdb Total atoms= 1572 Total time: 1.527 s
1N1K.pdb Total atoms= 3780 Total time: 140.152 s ***
1LEJ.pdb Total atoms= 10395 Total time: 612.496 s ***
1N14.pdb Total atoms= 16460 ***
1JU0.pdb Total atoms= 21856 ***

------------------------------------------------------------------

The starred entries are PDBs that I had trouble with. Particularly, they 
made the program exit with a fatal error: "maxnbr too small". As before, I 
tried to increase the MAXNBR constant in the program, recompiled it, and 
tried again. The successful ones were computed in huge amounts of time 
(140.152s and 612.496s as compared to 1.0-2.0s). The majority, however, 
produced the same error as before. I could not increase the constant MAXNBR 
too much because the Fortran compiler could not handle it. It turned out 
that there are two-dimensional arrays with the dimensions of Max # of atoms 
and Max # of neighbors defined in the program. The MAXNBR constant is 
actually Max # of neighbors. The amount of space (MAXATOMS * MAXNBR) for 
particular large values of MAXNBR turns out to be over the capabilities of 
the compiler, and unfortunately I was unable to run those PDBs which 
required MAXATOMS = 20000 and MAXBNR > 2000 (overall > 38 MB) or similar 
values for the constants.

I also studied the source code extensively. It turns out that if the default 
value of MAXNBR (= 150) is suitable for all kinds of PDBs then the author's 
claim that the running time is O(#atoms) is quite reasonable. This is not 
the case, however. Some of the PDBs required more than 3000 neighbors per 
atom. Since, the program performs specific calculations for each atom and 
over the list of its neighbors (identification of pairs and triples of 
contact with the probe sphere and surface elements calculations), I would 
conjecture that O(#atoms * max#neighbors) is a more realistic upper bound on 
the running time. (Max # of neighbors is not really a constant).

Results from VEGA (RT):

The input to Vega consisted of these 16 proteins, 6 RNAs, and 7 DNAs (the package had been rebuilt after removing I/O routines). The program computed the Van der Waals surface using a probe sphere of radius 1.4 Angstroms.

.PDB       Atoms    Time to compute (Concordant values of three readings).

Proteins:

4INS.pdb    829       9.807
1FUA.pdb   1608      20.437
3SEB.pdb   1948      29.141
1AF7.pdb   2224      38.924
1MX2.pdb   2375      46.451
1ADS.pdb   2513      49.211
1A4R.pdb   2976      88.162
1A1N.pdb   3153     135.309
1AOP.pdb   3622     142.211
1FMT.pdb   4694     299.611
1E70.pdb   4086     413.85
4PGA.pdb   4970     561.16
1A6V.pdb   5236     660.99
1A6Z.pdb   6076     928.739
1AD9.pdb   6740    1670.53
1A2K.pdb   6789    1602.6 

** Note: The time taken to compute surface of 1A2K shows a slight anomaly. This is because I was unable to complete three runs of Vega on it. The number represents the timing of one run only.

RNAs:

1EKA.pdb    512       1.768
1J4Y.pdb    548       2.041
1MHK.pdb    548       2.279
1F1T.pdb    816       5.127
1JJM.pdb    982       8.42
1MFJ.pdb   6410      77.186


DNAs:

1K6I.pdb    192       0.264
1M6A.pdb    236       0.316
1LJX.pdb    300       0.604
1KSE.pdb    374       1.197
1G8N.pdb    484       3.023
1G3V.pdb    769       4.881
1K9H.pdb    886       5.322


Results from DMS (DO-K):

PROTEINS

Molecule (.pdb)	No. of Atoms	Time	No. of Unknown Atoms
4INS	1181	0.443	
1FUA	1742	0.607	
1AF7	2361	1.014	
1MX2	2567	1.068	
1ADS	2665	0.988	26
1A4R	3087	1.525	1
1AIN	3511	1.496	
1AOP	4185	1.658	
1FTM	4845	2.34	
1E70	5196	2.017	
4PGA	5387	2.439	
1A6V	5802	3.115	
1A6Z	6099	4.438	
1AD9	6760	4.971	
1A2K	7029	4.531	3
1A2Z	7666	3.836	
1BBR	8084	4.814	
1AFC	8312	6.389	
1AFV	9024	7.912	
1A0S	9939	7.016	
1IAS	13181	18.012	
1AIG	14142	13.871	8
1I84	16580	31.162	
1A03	57920		
1M8Q	76872	461.502	


RNAs

Molecules (.pdb)	No. of Atoms	Time	No. of Unknown Atoms
1EKA	512	0.182	
1J4Y	548	0.182	
1MHK	570	0.205	
1F1T	860	0.311	
1DUH	988	0.391	2
1JJM	1089	0.426	12
488D	1412	1.455	
1KXK	1499	0.648	2
1FIR	1706	0.736	1
1I9V	1708	0.715	3
1EHZ	1821	0.76	9
1EVV	1896	0.805	10
1DUQ	2313	1.109	
1IDV	3230		
1DDY	3376	1.98	
1ET4	4945	2.646	
1MFJ	6410	10.76++	
1JUR	6435		
1L8V	6960	5.809	10
1HR2	7048	5.756	42
1FCW	8260	6.834	
1K4A	9060		
1FG0	10704	14.488	
1MFY	15840		
1E95	17430		
1M5L	18525		
1FYO	21675		
1JU1	22290		


DNAs

Molecule (.pdb)	No. of Atoms	Time	No. of Unknown Atoms
1KCI	192	0.057	
1M6A	236	0.068	
1LJX	300	0.072	
1KSE	434	0.16	
1MTG	516	0.195	
1J8L	600	0.195	1
1M6F	639	0.223	1
1JPQ	642	0.205	
1MKL	671	0.254	
1G8N	675	0.199	1
1N10	682	0.211	1
1K5F	713	0.285	
1N2W	766	0.324	
1MXK	815	0.379	
1G3V	829	0.311	
1K9H	886	0.355	
1IH4	1083	0.424	
1IH6	1202	0.459	
1JRN	1234	0.479	
1I3W	1318	0.48	
1ILC	1572	0.666	
1NIK	3780		
1LEJ	10395		
1N14	16460		
1JU0	21856		




Results from MASKER (PJ):

Protein|Num of|	Time
Name   |Atoms |	Taken
=======================

4INS	1181	2.49
1FUA	1742	3.74
1AF7	2361	5.84
1MX2	2567	6.62
1ADS	2665	6.26
1A4R	3087	9.94
1A1N	3511	9.17
1AOP	4185	9.28
1FMT	4845	17.70
1E70	5196	10.83
4PGA	5387	16.18
1A6V	5802	21.44
1A6Z	6099	30.65
1AD9	6760	35.37
1A2K	7029	32.01
1A2Z	7666	25.98
1BBR	8084	34.83
1AFC	8312	48.68
1AFV	9024	58.06
1A0S	9939	60.50
1IAS	13181	117.46
1AIG	14142	113.28
1I84	16580	192.29
1A03	57920	410.13
1M8Q	76872	3594.40

DNA   | Num of| Time
Name  | Atoms | Taken
=======================

1KCI	192	0.55
1M6A	236	0.55
1LJX	300	0.64
1KSE	434	0.95
1MTG	561	1.18
1J8L	600	1.26
1M6F	639	//Error: no result
1JPQ	642	1.16
1MKL	671	1.62
1G8N	675	1.35
1N1O	682	1.35
1K5F	713	1.55
1N2W	766	1.72
1MXK	815	1.82
1G3V	829	2.02
1K9H	886	2.06
1IH4	1083	2.43
1IH6	1202	2.58
1JRN	1234	2.33
1I3W	1318	2.66
1ILC	1572	4.59
1N1K	3780	7.83
1LEJ	10395	33.13
1N14	16460	72.60
1JU0	21856	91.96

RNA   | Num of| Time
Name  | Atoms | Taken
=======================

1EKA	512	1.20
1J4Y	548	1.22
1MHK	570	1.27
1F1T	860	2.00
1DUH	988	2.46
1JJM	1089	2.67
488D	1412	//Error: no result
1KXK	1499	3.97
1FIR	1706	4.58
1I9V	1708	4.74
1EHZ	1821	4.39
1EVV	1896	4.87
1DUQ	2313	7.23
1IDV	3230	7.46
1DDY	3376	12.03
1ET4	4945	16.25
1MFJ	6410	16.34
1JUR	6435	13.88
1L8V	6960	35.86
1HR2	7048	35.78
1FCW	8260	44.11
1K4A	9060	35.42
1FG0	10704	83.31
1MFY	15840	57.26
1E95	17430	65.60
1M5L	18525	79.51
1FYO	21675	125.02
1JU1	22290	94.66

ALL -- intercomparable results. Each person -- fill in your data -- do the runs, feed the results to a common graph

Discussion

ASC/NSC discussion (VS)

It is interesting to note the different complexities of the algorithms in the ASC stages. The first one ( determin the neighboring atoms list) takes a small fraction of the time for the whole computation. With increasing the number of atoms it does not change its behaviour. At the same time stage three takes almost negligable time for small pdbs while when dealing with the large files it consumes almost half of the whole time. Another interesting result is the very quick perfomance of NSC. As of now that is the fastest of the algorithms examined. It outperforms Sanner's MSMS by a factor of 2.

We were unable to run the ASC/NSC software on DNA/RNA pdbs. Each person leads the discussion on their own section. Everyone else comments


MSMS Discussion (PI)

On some of the problematic molecules with MSMS my suspision is that there is a problem with the .pdb file, which could have possibly be fixed with an .mmCIF file converted to .pdb. I do not think that there is a problem with MSMS, as there is nothing specifically problematic for the failing molecules in terms of size, or coordinates (as I ran MSMS with all extreme case and it seem to handle them nicely).

MSMS is interesting as it clearly shows the difference in the type of molecules. The DNA/RNA set shows approximately the same total time in terms of number of atoms in the molecule. With both of them, Singularities, Surface and Triangulation times are generally small, and for big molecules the dominant time is the Reduced Surface Calculation Time, which in the big molecules comprises close to 87% of the total time. With proteins (folded, dense molecules) the Reduced Surface Computations take on general less time than RNA/DNA (long strung out molecules) for same number of atoms. The conspicuous difference is in terms of ASES computations and the Triangulations, which with proteins take about 60%-70% of the total time (therefore roughly doubling the total time as compared to RNA/DNA).

DMS Discussion (DO-K)

Discussion: As the table of results show for the molecules, DMS generally does a pretty good job on small molecules as far as timing is concerned. However, as the number of atoms increases, the time complexity worsens. A key feature of the DMS algorithm that makes for this trend is that, during the computation of the molecular surface, the algorithm compares and sorts the atoms by coordinates and thus looks just at slices instead of all n atoms when looking for neighbors. Hence for most molecules, the algorithm performs better than O(n^2). Moreover, the major time complexity involved in the surface computation occurs when computing overlapping atoms. DMS first computes these, and then looks at other atoms not so positioned, and hence easier to calculate their surfaces. Therefore, as Huang claims, the surface computation algorithms strictly speaking, .should not be n^2 since they look mainly at the overlapping atoms, of which there should be no more than 20 for .normal. molecules.. By normal molecules, I assume he meant general proteins of reasonable size and structure. Huang however agrees that the algorithm is generally O(n^2).

As can be seen from the table, there are a few anomalies as far as the trend in the number of atoms corresponding to the time taken for surface computation is concerned. For instance, for proteins, 1ADS with 2665 atoms took 0.998 seconds to complete surface calculation, whereas 1MX2 with 2567 atoms took 1.068 seconds. The reason for this, from the above explanation of the algorithm is somewhat easy to comprehend. I would assume that 1MX2 has more overlapping atoms than 1ADS and hence takes a longer time to compute. Notice also that 1ADS has 26 unknown atoms. That is, atoms whose radii have not been accounted for in the radii list. However, the default radii used for unknown atoms, 1.90 angstroms, is much larger than most of the common atoms in most molecules. Hence if the radii of the unknown atoms played a crucial role, then one would think that 1ADS should have taken the longer time to complete surface computation, since it has more atoms. Thus, this implies that the orientation of the atoms in the molecules, and hence the structure of the molecule, does have an effect on the time complexity of whichever algorithm is doing the computation. Although some algorithms may be more efficient than others in this calculation, they still encounter slight overheads while accounting for the structure of the molecules.

Moreover, a quick glance at the DNAs and RNAs show a significant difference in the time as well as some anomalies in the trend, both of which could also be argued as due to the structure of the molecule. Also, during a discussion, Vince mentioned that most of these surface calculation algorithms were probably written primarily for .normal. molecules . proteins . to be precise, and hence were not well suited for calculation of complex structured RNA and DNA surfaces. And the truth about this statement further strengthens the assertion that structure does count.

MASKER discussion (PJ)

Masker was run on a set of 25 proteins, 25 DNAs, and 28 RNAs. Unnecessary output was suppressed before performing the test runs to get better approximation of the computation time. The three tables below represent the results from the test runs on the proteins, DNAs and the RNAs. The first column in each table is the name of the molecule, the second column is the number of atoms in that molecule and the third column is the time that Masker took to compute the molecular surface for that molecule. It is apparent that the results have some outliers or anomalous times; it is worth pointing that surface computation done on most of such molecules using other programs, like ASC, give anomalous times also. We can hypothesize that the reason behind such a tendency is the strung out structure of some molecules.

Also noticeable is the fact that Masker failed to complete surface computation on two of the chosen molecules -- 1M6F and 488D.

***Whether other surface computation programs do succeed in computing molecular surfaces for these two molecules would be an interesting observation.*** (PJ)

The factor that determines the speed of the computation done with MASKER is the masking operations. Bystroff believes that these binary operations will in the near future be encoded in special hardware. He states that additional speed-up would also be possible by initially computing the "reduced surface" (Sanner et al., 1996), and by using a neighbor-tree data structure for the atoms. However, Bystroff's conjecture would not transform into a claim until necessary hardware and software modifications are made.

Another question that Masker raises is if the brute force approach of the algorithm makes it more likely to be immune to algorithmic bugs that are associated with certain surface singularities. There is no definite answer to this question right now because Bystroff has discovered that Masker is not completely immune to singularities -- sometimes a reentrant surface disappears too early [Bystroff 2002]. But these errors are small, about the size of the numerical precision and they would get smaller if larger mask size (8192 bits) are used.

Vega Discussion (RT)

"Vega is leaving the planet!", Dr. Bernstein on the atoms-time curve.

Vega's algorithm is one of the most simple and calculation-intensive of all. It uses a brute-force method to compute the molecular surface which is very similar to the ones used by RasTop. As a result the time taken by Vega increases proportionally with the (number of atoms)2. It is not known if Vega has problems dealing with singularities, because the atoms-time curve does not show any strange behavior with the collection used. The curve shows a steady growth, and with I/O routies intact, running on a 800MhZ Pentium III, it showed a neat curve of O(n2). An observation about Vega is that it not only computes, but also displays and manipulates 3D molecular surfaces in its Windows version. Thus molecular surface calculation is but one of the various tasks that Vega can perform.

Conclusions

Suggestions appreciated by email Conclusion re MASKER (PJ):

It should be noted that MSMS and also ASC are much faster than the current version of Masker. Masker has greater computational overhead, since it must read a library of masks and mask data before beginning. The memory requirements of MASKER are also greater, again because of the library of masks.

From the computations conducted with Masker, it could be inferred that the runtime of Masker is a polynomial of degree two.

References

[Bondi 1964] Bondi, A. (1964) "Van der Waals volumes and radii", J. Phys. Chem. 68: 441-451.

[Bystroff 2001] Bystroff, C. (2001), "Masker:Improved Solvent Excluded Molecular Surface Area Estimations using Boolean Masks",Protein Engineering (in press). Renesseler Institute of Polytechnic, Troy, NY.

[Bystroff 2002] Bystroff, C. (2002) private communication.

[Connolly 1983] Connolly, M. L. (1983) "MS: Molecular Surface Program", QCPE Program 429, Quantum Chemistry Program Exchange, Univ. of Indiana, Bloomington, IN http://www.netsci.org/Science/Compchem/feature14.html

[Connolly 1996] Connolly, M. L. (1996) "Molecular Surfaces: A Review", http://www.netsci.org/Science/Compchem/feature14.html

[Corey, Pauling 1953] Corey, R. B. and L. Pauling (1953) "Molecular models of amino acids, peptides and proteins." Rev. Sci. Instr. 24: 621-627

[Dickerson, Geis 1969] Dickerson, R. E., Geis, I (1969) "The Structure and Action of Proteins", W. A. Benjamin, Menlo Park, CA, 120 pp.

[Eisenhaber 1995] Eisenhaber, F., Philip Lijnzaad, P., Argos, P., Sander, C. & Scharf, M. (1995) "The Double Cube Lattice Method: Efficient Approaches to Numerical Integration of Surface Area and Volume and to Dot Surface contouring of Molecular Assemblies", Journal of Computational Chemistry, volume 16, N3, pp.273-284. [Ferrin, Huang, Jarvis, Langridge 1988] Ferrin, T. E., Huang,C. C., Jarvis, L. E., Langridge, R. (1988) "The MIDAS display system", J. Mol. Graphics 6, 13-27.

[Katz, Levinthal 1972] Katz, L., C. Levinthal, C. (1072) "Interactive Computer Graphics and the Representation of Complex Biological Structures", Annual Reviews in Biophysics and Bioengineering, 1: 465-504.

[Lee, Richards 1971] Lee, B., Richards, F. M. (1971) "The Interpretation of Protein Structures: Estimation of Static Accessibility", J. Mol. Biol. 55: 379-400.

[LeGrand, Merz 1993] LeGrand S. M. & Merz, K. M. J. (1993) "Rapid Approximation to Molecular Surface Area via the Use of Boolean Logic and Look-up Tables", Journal of Computational Chemistry 14, Pages 349-52.

[Pedretti, Villa, Vistoli 2002] Pedretti, L. Villa, G. Vistoli (2002) "Vega: A Versatile Program to Convert, Handle and Visualize Molecular Structure on windows-based PCs" J. Mol. Graph., Vol. 21, 47-49.

[Sanner 1995] Sanner, M. F.; Olson, A. J.; Jean-Claude Spehner (1995) "Fast and Robust Computation of Molecular Surfaces", Proc. 11th ACM Symp. Comp. Geom, C6-C7.

[Sanner 1996] Sanner, M., Olson, A., Spehner, J.-C. (1996) "Reduced Surface: An Efficient Way to Compute Molecular Surfaces", Biopolymers 38: 305-320.

[Sanner 2002] Sanner M., private communication.

[Vorobjev 1997] Vorobjev, Y. N. (1997) "SIMS: Computation of Smooth Invariant Molecular Surface", Biophys. J. 73:722-732