Deadline for all input: 28 January 2003.
Deadline for us to get the final paper out the door: 7 February 2003
-- HJB
Note added 10 Dec: Added summary drawings to results. -- HJB
Note added 6 Dec: Material from Parag on Masker, Daniel on DMS and Petko in MSMS. -- HJB
Note added 1 Dec: Material from Ventsi on ASC added. Intercomparable results are needed ASAP. -- HJB
Gentlemen: Here it is. I've done a lot of editing. It is now tomorrow, so I will stop and let you take your turn. I hope each of you will read this, think about it and write good material. Improve what I have done and work towards a uniform style. Check my spelling and references. Provide good, solid material for the sections we still need. I hope you will email me a steady stream of material over the next two weeks. -- HJB
Understanding of so-called molecular surfaces is important in structural biology and in drug design. In 1971, Lee and Richards defined a molecular surface as the surface resulting from rolling a probe sphere on the van der Waals surface of a molecule. We compare recent algorithms for computing molecular surfaces. This work is ongoing, but results to date show that there are significant differences in the performance of some recently proposed algorithms, with some of them achieving performance suitable for use in interactive graphics programs (at least for smaller molecules), and some being suitable only for off-line computations of surfaces. A surprising result of our investigations has been the difficulty in computing surfaces of molecules composed of nucleic acids as opposed to amino acids. We conjecture that the difference is due to the lower degree of folding of the former. An exception is the VEGA program by Pedretti, which does not have an anomaly in the timing for nucleic acids, but which is, unfortunately one of the slower programs with what appears to be n-squared growth of time. The faster programs include Sanner's MSMS, Vorobjev's SIMS and Eisenhaber's NSC and ASC.
When using molecular graphics to dock two molecules, as in interactive drug design, it helps to assign a surface to one or both molecules. While not strictly physical, such a surface conveniently limits the extent to which molecules may interpenetrate, helping to estimate key-in-lock fits. (See [Dickerson, Geis 1969, p68]). Early surface renderings were done with space-filling Corey-Pauling-Koltun (CPK) [Corey, Pauling 1953] models, creating an implicit surface from the visible exterior of intersecting van der Waals spheres. Such a surface is useful, but has distracting features. Smoother surface representation of a surface allows better understanding of the fit between molecules. The commonly accepted defintion of a such a smoothed molecular surface was provided by Lee and Richards [Lee, Richards 1971]. Lee and Richards defined a molecular surface as the surface resulting from rolling a probe sphere on the van der Waals spheres of a CPK model of the molecule. The result is a surface which agrees with the CPK model where the van der Waals spheres stick out, but replaces the inward pointing sharp crevices of a CPK model with smooth torii and spherical triangles. For a review of the concepts and algorithmic developments related to molecular surfaces from 1971 until 1996 see M. Connolly's review [Connolly 1996]. Calculation of molecular surfaces can be computationally intensive. As interest focuses on larger and larger molecules, questions of efficiency of various approaches to such calculations become important. In this paper we consider the performance of several recent algorithms for computing molecular surfaces.
Some algorithms produce "better" results than others. ....This section to be filled in by P. Zhivkov and others
Our concern is with the use of molecular surface algorithms for rendering. It is natural to maintain a database of atomic positions in a general purpose rendering system. In computing molecular surfaces this then favors a two stage approach, of first identifying atoms near enough to one another to help to shape common portions of the surface, and then to render the surface components involved. Deciding which of n atoms are near one another by naive pairwise comparison leads to n-squared growth in time. In the mid 1960's Levinthal introduced what would now be called a divide-and-conquer approach to the calculation. At that time he called it "cubing", and that name is still used in molecular graphics. For an accessible reference, see [Katz, Levinthal 1972].
An alternative approach to deciding on which atoms are neighbors to one another would be to calculate the Voronoi diagram ***[ref ???]*** or Delaunay diagram ***[ref ???]*** of the molecule. These computational geometry techniques are functionally equivalent to cubing-based neighbor calculations. In some cases they are explicitly based on divide-and-conquer style cubing. For this portion of the surface calculation, explicit voronoi diagram based techniques do not seem to have a computational advanatage.
The second phase of the calculation is the actual rendering of the surface. Another, related calculation is the computation of the surface area or the volume contained within the surface. Voronoi diagram techniques are useful in efficiently deriving the surface area and the volume, but for rendering, in the absense of specialized hardware, the second phase appears to be dominated by the placement of pixels.
Maximal Speed Molecular Surface (MSMS) [Sanner 1995, 1996] is based on a recursive divide and conquer algorithm. MSMS constructs a surface called the "r-reduced surface of a set of atoms s" (Rr(S)) to fascilitate molecular surface computations. Rr(S)' outer component is computed in O[n*log(n)] and its corresponding r-Excluded surface component (Er(S)) in O[n] for n "calibrated spheres". "A set of spheres is calibrated if the ratio of the probe radius to the spheres representing atoms is bound by a constant and the distance between spheres representing atoms has a lower and an upper bound." [Sanner 2002] In building the Rr(S), because the surface is calibrated, the number of current edge's neighbors to consider is O[log(n)]. This is achieved using a hierarchical subdivision of space obtained by recursively dividing the bounding box of the set of atom's centers into two subsets of comparable sizes. Since Rr(S) has O[n], edges Rr(S) can be computed in O[n*log(n)].
Er(S) is built from Rr(S) in complexity is a linear function to the number of faces and edges of Rr(S).
Smooth Invariant Molecular Surfaces (SIMS) [Vorobjev 1997] is largely based on Connolly's method [Connolly 1983] with some modifications. It starts with a set of atomic coordinates, atomic radii, and a probe sphere radius. Neighbor identification for a specific atom I inside the sphere of extended radius (the radius of atom I + the radius of the probe sphere) is performed by cubing. Next, the computation of the MS proceeds as in Connoly's method: all contact triplets that can be touched simultaneously by the probe sphere are identified and respective probe sphere positions are calculated (vertices of tangency, etc) employing Connoly's formulae. These form spherical triangles from reentrant concave (RC) faces. All solvent probe positions for pairs of atoms are, next, calculated (probe sphere starting and stopping vertices, radii of rotation, etc), yielding saddle faces which are joined to the RC faces at common boundary arcs. And last, the contact portions, which result in a cutoff from the van derWaals spheres by the atom-torus contact planes, are calculated. The areas of all surface element are calculated analytically with Connolly's formulae.
Singularities are smoothed by using a second probe sphere of smaller radius ( <= 5 Å) rolled over the outward surface of the intersecting solvent probe spheres. The singularity is replaced by this new surface. Very homogeneous dot surfaces are produced by a constant arc length method. Invariance under rotation is achieved by using local coordinate systems. The claimed time complexity is claimed to be O[n].
The Analytic Surface Computation (ASC) package from Eisenhaber also includes a Numercial Surface Calculation (NSC). NSC is based on the Double Cubing Lattice Method (DCLM) [Eisenhaber 1995].
DCLM involves cubing techniques applied twice. The first time is when determining the neighboring atoms in the molecule and the second one is when computing overlapping surfaces.
Step number one involves cubing with a fixed side cube equal to 2rmax, where r is the radius of the probe sphere + the radius of the atom under consideration. rmax is simply the biggest r of all atoms. DCLM starts with determining the minimum and the maximum of the x-, y- and z- coordinates of all atoms in the molecule under consideration. The smallest rectangular box which is able to fit the molecular structure and has dimensions equal to a multiple of 2rmax is calculated. The box is then subdivided into cubes of size 2rmax and each atom is assigned a cube number (if an atom lies in box (ix, iy, iz) then the box number is equal to ix + (iy*bx) + (iz*bx*by). All atoms are sorted according to box number. The result of this operations is that all atoms belong to a cube and all possible occluding neighbors are either is the same cube or the 26 adjacent cubes. This concludes step one.
Step two is checking for occlusion between atoms. The surface of each atom is represented by evenly distributed dots (the so called dot-surface). Each of those dots has to be checked for occlusion. To speed up the process a second cubing lattice in a cube that is inscribed by the unit sphere is used.To compute the surface of an atom i, an occlusion check on all dots k that belong to i is done. To minimize search time for a neighbor occluding the dot the algorithm first tries the neighbor that occluded the previous dot.In order this heuristic to work the dots should be ordered by spatial proximity.
Step three is calculating the surface area.The Area A is obtained by
where macc(i) is the number of dots on atom I not occluded by neighboring atoms.
ASC [ref ???] involves three steps:
A Boolean masks approach for computing the SAS was first proposed by LeGrand and Merz. In their approach, a set of points were evenly spaced on the surface of a sphere centered at the origin and having radius 1.0, and each was represented by a single bit in a multi-byte word. A set of 'masks' was created by placing a probe sphere of unit radius at all positions around the origin, and all distances from zero to two. Points within the probe radius had their bits set to zero, while the remaining bits were set to one. Therefore, one multi-byte word (mask) corresponded to one probe sphere location. A binary AND operation over any number of masks resulted in another mask whose nonzero bits represent the SAS for one atom. The SAS is estimated quickly and accurately by summing the 1-bits and scaling. Bystroff's improved MASKER takes the method of LeGrand and Merz one step further by using masks to calculate SES. The problem is broken down to its three component parts. The contact surface is a scalar multiple of SAS. The toroidal surface can be estimated based on the length of exposed circular edge at the intersection of two atom surfaces. The reentrant surfaces are calculated by placing a mask at each position where a probe is in contact with exactly three atoms. Intersections and self-burial of toroidal and reentrant surfaces are accounted for by additional masking operations, as are the pairwise partial derivatives.
According to Bystroff, the runtime [of Masker] is a polynomial with an exponent of approximately 1.42 and the process is dominated by the masking operations which consumes about 73% of the CPU time.
Distributed Molecular Surface (DMS) by Conrad Huang is part of the MIDAS system [Ferrin, Huang, Jarvis, Langridge 1988]. DMS can be run on one machine or on multiple servers simultaneously. The algorithm is structured to facilitate parallel computation. The atom list is sorted first by the x coordinate, then by the y-coordinate and finally by the z coordinates. Neighbors are then computed and then surface components. The initial sort allows for computation in x-coordinate slabs. This provides a division of labor in one dimension, but with n-squared growth in the other two dimensions. Lists of surface points are generated.
The sequence of events is:
VEGA [Pedretti, Villa, Vistoli 2002] is a molecular graphics program that includes the ability to compute surfaces. The calculation is based on a creation of dot surfaces and removal of dots that fall in the interior of the molecule.
" A sphere of 1 Å radius is generated at the (0, 0, 0) coordinates using the dot density specified by the user. For each atom (Ai), this sphere is scaled multiplying the dot coordinates by the correspondent Van Der Waals radius (Ri) and translated to the atom coordinates (Xi, Yi, Zi). The dots overlapped to near atoms are discarded. This procedure is performed measuring the distance between the surface dots of the "Ai" atom and all atomic coordinates for all "n" atoms: if it's less than the "An" Van der Waals radius, the dot is overlapped. This algorithm takes O(n2) computation time."
The objective of the experiment is to determine the performance of the algorithms described above. The independent variable is number of atoms and the dependent variable is time consumed for area computation. The input data is divided into two different categories according to shape of the molecule : the first one is composed of proteins ( folded, dense molecules) and the second one is DNA/RNA (long strung out molecules). All input data has been downloaded from the Protein Data Bank (www.rcsb.org)[Reference] in pdb format.
The experiment was conducted on a Dell Dimension 2300 workstation. The machine runs Red Hat Linux 8.0. It has an Intel 845GL chipset Pentium 4 2GHz processor with 512 MB SDRAM. Only the vital processes of the Operating System were left running while performing the computations and the user time consumed by the different algorithms was measured. Standard atom radii were used as defined in [Bondi 1964] with a probe radius of 1.4Å
The experimental results are summarized in the following three figures.Â
Each figure is drawn with the y-axis being a log-plot time in seconds.Â
The x-axis is the number of atoms used. The first plot shows times to
compute surfaces of Proteins, the second plot shows times to compute
surfaces of DNAs and the third plot shows times to compute surfaces of
RNAs. For each program a curve of the form t = a
nb has been fit. While this is not the "natural" fit for all
the algorithms, it is a reasonable fit for many of them, and allows for
a performance comparison among them.
Results from ASC:
The input data for proteins consists of 30 pdbs. The table below represent the results for ASC and NSC together. The first column is the pdb ID as listed in the Protein Data Bank. The second column is number of atoms. Where there are two entries in this column the first is the number of atoms as given in the pdb format and the second one is the actual atoms selected for computation. For the small pdbs the difference is insignificant but for the large pdbs it is noted. The ASC_1 column represent the time for the first step in ASC. The ASC_2 column represents time taken for step 1 + step 2. The third ASC column has the total time ASC takes for its three main steps. The NSC column has the total time taken for computing the surface of a molecule using the NSC algorithm.
protein number of atoms ASC_1 ASC_2 ASC_3 NSC 4INS 1,158 00.01 0.08 0.09 0.04 1FUA 1,727 0.03 0.16 0.17 0.07 3SEB 2,125 0.03 0.21 0.23 0.10 1AF7 2,361 0.03 0.23 0.25 0.12 1MX2 2,567 0.03 0.24 0.26 0.13 1ADS 2,665 0.04 0.27 0.29 0.12 1A4R 3,087 0.05 0.32 0.35 0.16 1A1N 3,511 0.05 0.33 0.38 0.15 1AOP 4,153 0.06 0.40 0.44 0.18 1FMT 4,845 0.07 0.51 0.55 0.25 1E7O 5,122 0.08 0.49 0.56 0.20 4PGA 5,387 0.09 0.56 0.65 0.24 1A6V 5,792 0.09 0.62 0.68 0.27 1A6Z 6,099 0.11 0.66 0.75 0.31 1AD9 6,760 0.11 0.73 0.82 0.34 1A2K 7,029 0.12 0.74 0.87 0.36 1HMV 7,399 0.12 0.79 1 0.38 1A2Z 7,652 0.12 0.81 0.96 0.37 1BBR 8,084 0.13 0.82 0.91 0.38 1AFC 8,312 0.12 0.84 0.90 0.39 1AFV 9,024 0.15 0.99 1.17 0.5 1A0S 9,939 0.18 1.12 1.52 0.47 1IAS 13,187 0.22 1.52 1.86 0.62 1AIG 14,142 13,014 0.22 1.47 1.92 0.69 1I84 16,580 16,491 0.25 1.74 2.74 0.79 1jmu 23,657 23,268 0.43 2.71 5.69 1.35 1A49 34,001 31,824 0.58 3.67 5.83 1.58 1RCX 39,368 37,456 0.84 4.83 8.85 1.95 2btv 49,061 49,061 0.93 5.58 12.02 2.50 1M8Q 76,872 74,892 1.35 8.33 14.66 3.80
Results from MSMS (PI):
/*********************************************************************** * file -> pdb file * * # atoms -> number of atoms, used in MS calculations * * r-reduced -> Reduced Surface Calculations Time * * ases -> Analytical Solvent Excluded Surface * * surf -> Surface Time * * sing -> Singularities Time * * triang -> Triangulation Time * * total -> Total Time * ************************************************************************/ DNA ============ file # atoms r-reduced ases(surf+sing) triang total ----------------------------------------------------------------- 1KCI 192 0.02 0.01+0.00 0.02 0.06 1M6A 236 0.03 0.00+0.01 0.02 0.07 1LJX 300 0.03 0.01+0.00 0.03 0.08 1KSE 434 0.07 0.01+0.01 0.05 0.14 1MTG 561 0.08 0.01+0.01 0.05 0.17 1J8L 600 0.08 0.01+0.01 0.06 0.17 1M6F 639 0.07 0.01+0.00 0.06 0.16 1JPQ 642 0.05 0.01+0.01 0.05 0.13 1MKL 671 0.11 0.01+0.01 0.08 0.23 1G8N 675 0.07 0.01+0.01 0.07 0.18 1N1O 682 0.07 0.01+0.01 0.07 0.17 1K5F 713 0.11 0.02+0.01 0.07 0.23 1N2W 766 0.13 0.01+0.02 0.08 0.26 1MXK 815 0.12 0.02+0.01 0.08 0.25 1G3V 829 0.11 0.01+0.02 0.10 0.25 1K9H 886 0.15 0.02+0.02 0.10 0.31 1IH4 1083 0.12 0.01+0.02 0.09 0.26 1IH6 1202 0.12 0.02+0.02 0.10 0.29 1JRN 1234 0.11 0.01+0.01 0.10 0.26 1I3W 1318 0.12 0.01+0.01 0.09 0.27 1ILC 1572 0.21 0.03+0.02 0.17 0.49 1N1K 3780 1.14 0.02+0.02 0.09 1.31 1LEJ 10395 2.07 0.03+0.02 0.12 2.35 1N14 16460 2.48 0.03+0.02 0.13 2.80 1JU0 21856 4.16 0.05+0.05 0.23 4.70 RNA ============ file # atoms r-reduced ases(surf+sing) triang total ----------------------------------------------------------------- 1EKA 512 0.07 0.01+0.01 0.04 0.15 1J4Y 548 0.08 0.01+0.01 0.05 0.17 1MHK 570 0.07 0.01+0.01 0.06 0.17 1F1T 860 0.01 0.01+0.01 0.09 0.23 1DUH 988 0.13 0.02+0.01 0.12 0.31 1JJM 1089 0.13 0.03+0.01 0.13 0.34 1KXK 1499 0.20 0.03+0.03 0.14 0.44 1FIR 1706 0.22 0.03+0.03 0.17 0.51 1I9V 1708 0.21 0.04+0.03 0.17 0.51 1EHZ 1818 0.19 0.03+0.02 0.16 0.45 1EVV 1896 0.21 0.04+0.03 0.19 0.52 1DUQ 2313 0.29 0.04+0.04 0.24 0.68 1IDV 3230 0.67 0.01+0.01 0.05 0.77 1DDY 3376 0.35 0.05+0.04 0.25 0.77 1ET4 4945 0.49 0.06+0.05 0.41 1.13 1MFJ 6410 1.52 0.03+0.01 0.12 1.77 1JUR 6435 0.52 0.03+0.02 0.13 0.76 1L8V 6960 0.93 0.14+0.10 0.67 2.07 1HR2 7048 0.90 0.13+0.10 0.65 2.02 1FCW 8260 1.02 0.12+0.10 0.64 2.10 1K4A 9060 2.30 0.02+0.02 0.08 2.51 1FG0 10667 1.54 0.22+0.17 1.12 3.42 1MFY 15840 2.04 0.03+0.02 0.14 2.36 1E95 17430 2.59 0.04+0.02 0.16 2.96 1M5L 18525 2.56 0.08+0.07 0.36 3.31 1JU1 22290 3.38 0.05+0.05 0.22 3.91 1FYO ERROR: findBHcloseAtomsInNode: result array too small 488D SEG FAULT PROTEINS ============ file # atoms r-reduced ases(surf+sing) triang total ----------------------------------------------------------------- 4INS 1181 0.13 0.02+0.02 0.12 0.33 1FUA 1742 0.13 0.03+0.01 0.14 0.35 3SEB 2125 0.16 0.02+0.01 0.15 0.39 1AF7 2361 0.19 0.03+0.02 0.20 0.50 1MX2 2567 0.22 0.03+0.03 0.21 0.55 1ADS 2665 0.18 0.04+0.02 0.17 0.46 1A4R 3087 0.28 0.04+0.04 0.28 0.71 1A1N 3511 0.33 0.04+0.02 0.25 0.71 1AOP 4185 0.25 0.04+0.03 0.25 0.65 1FMT 4845 0.41 0.05+0.04 0.35 0.97 1E70 5196 0.29 0.04+0.04 0.26 0.71 4PGA 5387 0.33 0.05+0.04 0.32 0.85 *1A6V 5802 0.44 0.06+0.06 0.43 1.11 1A6Z 6099 0.63 0.09+0.08 0.54 1.51 1AD9 6760 0.61 0.09+0.07 0.55 1.51 1A2K 7029 0.59 0.08+0.07 0.52 1.43 1HMV 7399 0.76 0.12+0.09 0.69 1.88 1A2Z 7666 0.50 0.07+0.06 0.44 1.22 1BBR 8084 0.77 0.10+0.07 0.60 1.72 1AFC 8312 0.10 0.02+0.00 0.08 0.26 1AFV 9024 0.89 0.13+0.10 0.78 2.15 *1A0S 9939 0.78 0.11+0.09 0.62 1.82 1IAS 13106 1.32 0.20+0.15 1.13 3.17 1AIG 13535 0.63 0.10+0.07 0.52 1.53 1I84 16536 1.71 0.28+0.20 1.67 4.36 1jmu 23268 2.10 0.30+0.23 1.77 4.98 1A49 31872 2.82 0.43+0.31 2.39 6.77 1RCX 37934 2.37 0.35+0.26 1.92 5.66 *1M8Q 75497 8.05 1.15+0.90 6.81 19.30 *2btv SEG FAULT
Results from SIMS (PZ):
Proteins: --------- 4INS.pdb Total atoms= 1181 Total time: 1.146 s 1FUA.pdb Total atoms= 1742 Total time: 1.742 s 3SEB.pdb Total atoms= 2125 Total time: 2.014 s 1AF7.pdb Total atoms= 2361 Total time: 2.316 s 1MX2.pdb Total atoms= 2567 Total time: 2.504 s 1ADS.pdb Total atoms= 2665 Total time: 2.510 s 1A4R.pdb Total atoms= 3087 Total time: 3.363 s 1A1N.pdb Total atoms= 3511 Total time: 3.314 s 1AOP.pdb Total atoms= 4185 Total time: 3.609 s 1FMT.pdb Total atoms= 4845 Total time: 4.883 s 1E70.pdb Total atoms= 5196 Total time: 4.277 s 4PGA.pdb Total atoms= 5387 Total time: 4.854 s 1A6V.pdb Total atoms= 5802 Total time: 5.684 s 1A6Z.pdb Total atoms= 6099 Total time: 6.908 s 1AD9.pdb Total atoms= 6760 Total time: 7.611 s 1A2K.pdb Total atoms= 7029 Total time: 7.385 s 1HMV.pdb Total atoms= 7399 Total time: 9.404 s 1A2Z.pdb Total atoms= 7666 Total time: 7.074 s 1BBR.pdb Total atoms= 8084 Total time: 8.221 s 1AFC.pdb Total atoms= 8312 Total time: 9.467 s 1AFV.pdb Total atoms= 9024 Total time: 10.512 s 1A0S.pdb Total atoms= 9939 Total time: 10.217 s 1IAS.pdb Total atoms= 13181 Total time: 18.146 s 1AIG.pdb Total atoms= 14142 Total time: 17.010 s 1I84.pdb Total atoms= 16580 Total time: 30.838 s 1jmu.pdb Total atoms= 23657 Total time: 43.811 s 1A49.pdb Total atoms= 34001 Total time: 61.234 s 1RCX.pdb Total atoms= 39368 Total time: 46.691 s 2btv.pdb Total atoms= 49061 Total time: 93.037 s 1M8Q.pdb Total atoms= 76872 Total time: > 15 min ---------------------------------------------------------------- RNAs: ------ 1EKA.pdb Total atoms= 512 Total time: 0.664 s 1J4Y.pdb Total atoms= 548 Total time: 0.674 s 1MHK.pdb Total atoms= 570 Total time: 0.668 s 1F1T.pdb Total atoms= 860 Total time: 0.920 s 1DUH.pdb Total atoms= 988 Total time: 1.014 s 1JJM.pdb Total atoms= 1089 Total time: 1.148 s 488D.pdb Total atoms= 1412 Total time: 0.977 s 1KXK.pdb Total atoms= 1499 Total time: 1.477 s 1FIR.pdb Total atoms= 1706 Total time: 1.707 s 1I9V.pdb Total atoms= 1708 Total time: 1.672 s 1EHZ.pdb Total atoms= 1821 Total time: 1.693 s 1EVV.pdb Total atoms= 1896 Total time: 1.947 s 1DUQ.pdb Total atoms= 2313 Total time: 2.211 s 1DDY.pdb Total atoms= 3376 Total time: 3.314 s 1ET4.pdb Total atoms= 4945 Total time: 4.691 s 1L8V.pdb Total atoms= 6960 Total time: 7.689 s 1HR2.pdb Total atoms= 7048 Total time: 8.092 s 1FCW.pdb Total atoms= 8260 Total time: 10.912 s 1FG0.pdb Total atoms= 10704 Total time: 15.092 s 1IDV.pdb Total atoms= 3230 Total time: 101.441 s 1MFJ.pdb Total atoms= 6410 Total time: 221.402 s 1JUR.pdb Total atoms= 6435 Total time: 87.754 s 1K4A.pdb Total atoms= 9060 *** 1MFY.pdb Total atoms= 15840 *** 1E95.pdb Total atoms= 17430 *** 1M5L.pdb Total atoms= 18525 *** 1FYO.pdb Total atoms= 21675 *** 1JU1.pdb Total atoms= 22290 *** DNAs: ------ 1KCI.pdb Total atoms= 192 Total time: 0.311 s 1M6A.pdb Total atoms= 236 Total time: 0.393 s 1LJX.pdb Total atoms= 300 Total time: 0.404 s 1KSE.pdb Total atoms= 434 Total time: 0.629 s 1MTG.pdb Total atoms= 561 Total time: 0.732 s 1J8L.pdb Total atoms= 600 Total time: 0.641 s 1M6F.pdb Total atoms= 639 Total time: 0.658 s 1JPQ.pdb Total atoms= 642 Total time: 0.650 s 1MKL.pdb Total atoms= 671 Total time: 0.838 s 1G8N.pdb Total atoms= 675 Total time: 0.680 s 1N1O.pdb Total atoms= 682 Total time: 0.668 s 1K5F.pdb Total atoms= 713 Total time: 0.908 s 1N2W.pdb Total atoms= 766 Total time: 0.932 s 1MXK.pdb Total atoms= 815 Total time: 0.973 s 1G3V.pdb Total atoms= 829 Total time: 0.879 s 1K9H.pdb Total atoms= 886 Total time: 1.049 s 1IH4.pdb Total atoms= 1083 Total time: 1.021 s 1IH6.pdb Total atoms= 1202 Total time: 1.105 s 1JRN.pdb Total atoms= 1234 Total time: 1.156 s 1I3W.pdb Total atoms= 1318 Total time: 1.223 s 1ILC.pdb Total atoms= 1572 Total time: 1.527 s 1N1K.pdb Total atoms= 3780 Total time: 140.152 s *** 1LEJ.pdb Total atoms= 10395 Total time: 612.496 s *** 1N14.pdb Total atoms= 16460 *** 1JU0.pdb Total atoms= 21856 *** ------------------------------------------------------------------ The starred entries are PDBs that I had trouble with. Particularly, they made the program exit with a fatal error: "maxnbr too small". As before, I tried to increase the MAXNBR constant in the program, recompiled it, and tried again. The successful ones were computed in huge amounts of time (140.152s and 612.496s as compared to 1.0-2.0s). The majority, however, produced the same error as before. I could not increase the constant MAXNBR too much because the Fortran compiler could not handle it. It turned out that there are two-dimensional arrays with the dimensions of Max # of atoms and Max # of neighbors defined in the program. The MAXNBR constant is actually Max # of neighbors. The amount of space (MAXATOMS * MAXNBR) for particular large values of MAXNBR turns out to be over the capabilities of the compiler, and unfortunately I was unable to run those PDBs which required MAXATOMS = 20000 and MAXBNR > 2000 (overall > 38 MB) or similar values for the constants. I also studied the source code extensively. It turns out that if the default value of MAXNBR (= 150) is suitable for all kinds of PDBs then the author's claim that the running time is O(#atoms) is quite reasonable. This is not the case, however. Some of the PDBs required more than 3000 neighbors per atom. Since, the program performs specific calculations for each atom and over the list of its neighbors (identification of pairs and triples of contact with the probe sphere and surface elements calculations), I would conjecture that O(#atoms * max#neighbors) is a more realistic upper bound on the running time. (Max # of neighbors is not really a constant).
Results from VEGA (RT):
The input to Vega consisted of these 16 proteins, 6 RNAs, and 7 DNAs (the package had been rebuilt after removing I/O routines). The program computed the Van der Waals surface using a probe sphere of radius 1.4 Angstroms.
.PDB Atoms Time to compute (Concordant values of three readings). Proteins: 4INS.pdb 829 9.807 1FUA.pdb 1608 20.437 3SEB.pdb 1948 29.141 1AF7.pdb 2224 38.924 1MX2.pdb 2375 46.451 1ADS.pdb 2513 49.211 1A4R.pdb 2976 88.162 1A1N.pdb 3153 135.309 1AOP.pdb 3622 142.211 1FMT.pdb 4694 299.611 1E70.pdb 4086 413.85 4PGA.pdb 4970 561.16 1A6V.pdb 5236 660.99 1A6Z.pdb 6076 928.739 1AD9.pdb 6740 1670.53 1A2K.pdb 6789 1602.6
** Note: The time taken to compute surface of 1A2K shows a slight anomaly. This is because I was unable to complete three runs of Vega on it. The number represents the timing of one run only.
RNAs: 1EKA.pdb 512 1.768 1J4Y.pdb 548 2.041 1MHK.pdb 548 2.279 1F1T.pdb 816 5.127 1JJM.pdb 982 8.42 1MFJ.pdb 6410 77.186 DNAs: 1K6I.pdb 192 0.264 1M6A.pdb 236 0.316 1LJX.pdb 300 0.604 1KSE.pdb 374 1.197 1G8N.pdb 484 3.023 1G3V.pdb 769 4.881 1K9H.pdb 886 5.322
Results from DMS (DO-K):
PROTEINS Molecule (.pdb) No. of Atoms Time No. of Unknown Atoms 4INS 1181 0.443 1FUA 1742 0.607 1AF7 2361 1.014 1MX2 2567 1.068 1ADS 2665 0.988 26 1A4R 3087 1.525 1 1AIN 3511 1.496 1AOP 4185 1.658 1FTM 4845 2.34 1E70 5196 2.017 4PGA 5387 2.439 1A6V 5802 3.115 1A6Z 6099 4.438 1AD9 6760 4.971 1A2K 7029 4.531 3 1A2Z 7666 3.836 1BBR 8084 4.814 1AFC 8312 6.389 1AFV 9024 7.912 1A0S 9939 7.016 1IAS 13181 18.012 1AIG 14142 13.871 8 1I84 16580 31.162 1A03 57920 1M8Q 76872 461.502 RNAs Molecules (.pdb) No. of Atoms Time No. of Unknown Atoms 1EKA 512 0.182 1J4Y 548 0.182 1MHK 570 0.205 1F1T 860 0.311 1DUH 988 0.391 2 1JJM 1089 0.426 12 488D 1412 1.455 1KXK 1499 0.648 2 1FIR 1706 0.736 1 1I9V 1708 0.715 3 1EHZ 1821 0.76 9 1EVV 1896 0.805 10 1DUQ 2313 1.109 1IDV 3230 1DDY 3376 1.98 1ET4 4945 2.646 1MFJ 6410 10.76++ 1JUR 6435 1L8V 6960 5.809 10 1HR2 7048 5.756 42 1FCW 8260 6.834 1K4A 9060 1FG0 10704 14.488 1MFY 15840 1E95 17430 1M5L 18525 1FYO 21675 1JU1 22290 DNAs Molecule (.pdb) No. of Atoms Time No. of Unknown Atoms 1KCI 192 0.057 1M6A 236 0.068 1LJX 300 0.072 1KSE 434 0.16 1MTG 516 0.195 1J8L 600 0.195 1 1M6F 639 0.223 1 1JPQ 642 0.205 1MKL 671 0.254 1G8N 675 0.199 1 1N10 682 0.211 1 1K5F 713 0.285 1N2W 766 0.324 1MXK 815 0.379 1G3V 829 0.311 1K9H 886 0.355 1IH4 1083 0.424 1IH6 1202 0.459 1JRN 1234 0.479 1I3W 1318 0.48 1ILC 1572 0.666 1NIK 3780 1LEJ 10395 1N14 16460 1JU0 21856
Results from MASKER (PJ):
Protein|Num of| Time Name |Atoms | Taken ======================= 4INS 1181 2.49 1FUA 1742 3.74 1AF7 2361 5.84 1MX2 2567 6.62 1ADS 2665 6.26 1A4R 3087 9.94 1A1N 3511 9.17 1AOP 4185 9.28 1FMT 4845 17.70 1E70 5196 10.83 4PGA 5387 16.18 1A6V 5802 21.44 1A6Z 6099 30.65 1AD9 6760 35.37 1A2K 7029 32.01 1A2Z 7666 25.98 1BBR 8084 34.83 1AFC 8312 48.68 1AFV 9024 58.06 1A0S 9939 60.50 1IAS 13181 117.46 1AIG 14142 113.28 1I84 16580 192.29 1A03 57920 410.13 1M8Q 76872 3594.40 DNA | Num of| Time Name | Atoms | Taken ======================= 1KCI 192 0.55 1M6A 236 0.55 1LJX 300 0.64 1KSE 434 0.95 1MTG 561 1.18 1J8L 600 1.26 1M6F 639 //Error: no result 1JPQ 642 1.16 1MKL 671 1.62 1G8N 675 1.35 1N1O 682 1.35 1K5F 713 1.55 1N2W 766 1.72 1MXK 815 1.82 1G3V 829 2.02 1K9H 886 2.06 1IH4 1083 2.43 1IH6 1202 2.58 1JRN 1234 2.33 1I3W 1318 2.66 1ILC 1572 4.59 1N1K 3780 7.83 1LEJ 10395 33.13 1N14 16460 72.60 1JU0 21856 91.96 RNA | Num of| Time Name | Atoms | Taken ======================= 1EKA 512 1.20 1J4Y 548 1.22 1MHK 570 1.27 1F1T 860 2.00 1DUH 988 2.46 1JJM 1089 2.67 488D 1412 //Error: no result 1KXK 1499 3.97 1FIR 1706 4.58 1I9V 1708 4.74 1EHZ 1821 4.39 1EVV 1896 4.87 1DUQ 2313 7.23 1IDV 3230 7.46 1DDY 3376 12.03 1ET4 4945 16.25 1MFJ 6410 16.34 1JUR 6435 13.88 1L8V 6960 35.86 1HR2 7048 35.78 1FCW 8260 44.11 1K4A 9060 35.42 1FG0 10704 83.31 1MFY 15840 57.26 1E95 17430 65.60 1M5L 18525 79.51 1FYO 21675 125.02 1JU1 22290 94.66
ASC/NSC discussion (VS)
It is interesting to note the different complexities of the algorithms in the ASC stages. The first one ( determin the neighboring atoms list) takes a small fraction of the time for the whole computation. With increasing the number of atoms it does not change its behaviour. At the same time stage three takes almost negligable time for small pdbs while when dealing with the large files it consumes almost half of the whole time. Another interesting result is the very quick perfomance of NSC. As of now that is the fastest of the algorithms examined. It outperforms Sanner's MSMS by a factor of 2.
We were unable to run the ASC/NSC software on DNA/RNA pdbs. Each person leads the discussion on their own section. Everyone else comments
MSMS Discussion (PI)
On some of the problematic molecules with MSMS my suspision is that there is a problem with the .pdb file, which could have possibly be fixed with an .mmCIF file converted to .pdb. I do not think that there is a problem with MSMS, as there is nothing specifically problematic for the failing molecules in terms of size, or coordinates (as I ran MSMS with all extreme case and it seem to handle them nicely).
MSMS is interesting as it clearly shows the difference in the type of molecules. The DNA/RNA set shows approximately the same total time in terms of number of atoms in the molecule. With both of them, Singularities, Surface and Triangulation times are generally small, and for big molecules the dominant time is the Reduced Surface Calculation Time, which in the big molecules comprises close to 87% of the total time. With proteins (folded, dense molecules) the Reduced Surface Computations take on general less time than RNA/DNA (long strung out molecules) for same number of atoms. The conspicuous difference is in terms of ASES computations and the Triangulations, which with proteins take about 60%-70% of the total time (therefore roughly doubling the total time as compared to RNA/DNA).
DMS Discussion (DO-K)
Discussion: As the table of results show for the molecules, DMS generally does a pretty good job on small molecules as far as timing is concerned. However, as the number of atoms increases, the time complexity worsens. A key feature of the DMS algorithm that makes for this trend is that, during the computation of the molecular surface, the algorithm compares and sorts the atoms by coordinates and thus looks just at slices instead of all n atoms when looking for neighbors. Hence for most molecules, the algorithm performs better than O(n^2). Moreover, the major time complexity involved in the surface computation occurs when computing overlapping atoms. DMS first computes these, and then looks at other atoms not so positioned, and hence easier to calculate their surfaces. Therefore, as Huang claims, the surface computation algorithms strictly speaking, .should not be n^2 since they look mainly at the overlapping atoms, of which there should be no more than 20 for .normal. molecules.. By normal molecules, I assume he meant general proteins of reasonable size and structure. Huang however agrees that the algorithm is generally O(n^2).
As can be seen from the table, there are a few anomalies as far as the trend in the number of atoms corresponding to the time taken for surface computation is concerned. For instance, for proteins, 1ADS with 2665 atoms took 0.998 seconds to complete surface calculation, whereas 1MX2 with 2567 atoms took 1.068 seconds. The reason for this, from the above explanation of the algorithm is somewhat easy to comprehend. I would assume that 1MX2 has more overlapping atoms than 1ADS and hence takes a longer time to compute. Notice also that 1ADS has 26 unknown atoms. That is, atoms whose radii have not been accounted for in the radii list. However, the default radii used for unknown atoms, 1.90 angstroms, is much larger than most of the common atoms in most molecules. Hence if the radii of the unknown atoms played a crucial role, then one would think that 1ADS should have taken the longer time to complete surface computation, since it has more atoms. Thus, this implies that the orientation of the atoms in the molecules, and hence the structure of the molecule, does have an effect on the time complexity of whichever algorithm is doing the computation. Although some algorithms may be more efficient than others in this calculation, they still encounter slight overheads while accounting for the structure of the molecules.
Moreover, a quick glance at the DNAs and RNAs show a significant difference in the time as well as some anomalies in the trend, both of which could also be argued as due to the structure of the molecule. Also, during a discussion, Vince mentioned that most of these surface calculation algorithms were probably written primarily for .normal. molecules . proteins . to be precise, and hence were not well suited for calculation of complex structured RNA and DNA surfaces. And the truth about this statement further strengthens the assertion that structure does count.
MASKER discussion (PJ)
Masker was run on a set of 25 proteins, 25 DNAs, and 28 RNAs. Unnecessary output was suppressed before performing the test runs to get better approximation of the computation time. The three tables below represent the results from the test runs on the proteins, DNAs and the RNAs. The first column in each table is the name of the molecule, the second column is the number of atoms in that molecule and the third column is the time that Masker took to compute the molecular surface for that molecule. It is apparent that the results have some outliers or anomalous times; it is worth pointing that surface computation done on most of such molecules using other programs, like ASC, give anomalous times also. We can hypothesize that the reason behind such a tendency is the strung out structure of some molecules.
Also noticeable is the fact that Masker failed to complete surface computation on two of the chosen molecules -- 1M6F and 488D.
***Whether other surface computation programs do succeed in computing molecular surfaces for these two molecules would be an interesting observation.*** (PJ)
The factor that determines the speed of the computation done with MASKER is the masking operations. Bystroff believes that these binary operations will in the near future be encoded in special hardware. He states that additional speed-up would also be possible by initially computing the "reduced surface" (Sanner et al., 1996), and by using a neighbor-tree data structure for the atoms. However, Bystroff's conjecture would not transform into a claim until necessary hardware and software modifications are made.
Another question that Masker raises is if the brute force approach of the algorithm makes it more likely to be immune to algorithmic bugs that are associated with certain surface singularities. There is no definite answer to this question right now because Bystroff has discovered that Masker is not completely immune to singularities -- sometimes a reentrant surface disappears too early [Bystroff 2002]. But these errors are small, about the size of the numerical precision and they would get smaller if larger mask size (8192 bits) are used.
Vega Discussion (RT)
"Vega is leaving the planet!", Dr. Bernstein on the atoms-time curve.Vega's algorithm is one of the most simple and calculation-intensive of all. It uses a brute-force method to compute the molecular surface which is very similar to the ones used by RasTop. As a result the time taken by Vega increases proportionally with the (number of atoms)2. It is not known if Vega has problems dealing with singularities, because the atoms-time curve does not show any strange behavior with the collection used. The curve shows a steady growth, and with I/O routies intact, running on a 800MhZ Pentium III, it showed a neat curve of O(n2). An observation about Vega is that it not only computes, but also displays and manipulates 3D molecular surfaces in its Windows version. Thus molecular surface calculation is but one of the various tasks that Vega can perform.
It should be noted that MSMS and also ASC are much faster than the current version of Masker. Masker has greater computational overhead, since it must read a library of masks and mask data before beginning. The memory requirements of MASKER are also greater, again because of the library of masks.
From the computations conducted with Masker, it could be inferred that the runtime of Masker is a polynomial of degree two.
[Bondi 1964] Bondi, A. (1964) "Van der Waals volumes and radii", J. Phys. Chem. 68: 441-451.
[Bystroff 2001] Bystroff, C. (2001), "Masker:Improved Solvent Excluded Molecular Surface Area Estimations using Boolean Masks",Protein Engineering (in press). Renesseler Institute of Polytechnic, Troy, NY.
[Bystroff 2002] Bystroff, C. (2002) private communication.
[Connolly 1983] Connolly, M. L. (1983) "MS: Molecular Surface Program", QCPE Program 429, Quantum Chemistry Program Exchange, Univ. of Indiana, Bloomington, IN http://www.netsci.org/Science/Compchem/feature14.html
[Connolly 1996] Connolly, M. L. (1996) "Molecular Surfaces: A Review", http://www.netsci.org/Science/Compchem/feature14.html
[Corey, Pauling 1953] Corey, R. B. and L. Pauling (1953) "Molecular models of amino acids, peptides and proteins." Rev. Sci. Instr. 24: 621-627
[Dickerson, Geis 1969] Dickerson, R. E., Geis, I (1969) "The Structure and Action of Proteins", W. A. Benjamin, Menlo Park, CA, 120 pp.
[Eisenhaber 1995] Eisenhaber, F., Philip Lijnzaad, P., Argos, P., Sander, C. & Scharf, M. (1995) "The Double Cube Lattice Method: Efficient Approaches to Numerical Integration of Surface Area and Volume and to Dot Surface contouring of Molecular Assemblies", Journal of Computational Chemistry, volume 16, N3, pp.273-284. [Ferrin, Huang, Jarvis, Langridge 1988] Ferrin, T. E., Huang,C. C., Jarvis, L. E., Langridge, R. (1988) "The MIDAS display system", J. Mol. Graphics 6, 13-27.
[Katz, Levinthal 1972] Katz, L., C. Levinthal, C. (1072) "Interactive Computer Graphics and the Representation of Complex Biological Structures", Annual Reviews in Biophysics and Bioengineering, 1: 465-504.
[Lee, Richards 1971] Lee, B., Richards, F. M. (1971) "The Interpretation of Protein Structures: Estimation of Static Accessibility", J. Mol. Biol. 55: 379-400.
[LeGrand, Merz 1993] LeGrand S. M. & Merz, K. M. J. (1993) "Rapid Approximation to Molecular Surface Area via the Use of Boolean Logic and Look-up Tables", Journal of Computational Chemistry 14, Pages 349-52.
[Pedretti, Villa, Vistoli 2002] Pedretti, L. Villa, G. Vistoli (2002) "Vega: A Versatile Program to Convert, Handle and Visualize Molecular Structure on windows-based PCs" J. Mol. Graph., Vol. 21, 47-49.
[Sanner 1995] Sanner, M. F.; Olson, A. J.; Jean-Claude Spehner (1995) "Fast and Robust Computation of Molecular Surfaces", Proc. 11th ACM Symp. Comp. Geom, C6-C7.
[Sanner 1996] Sanner, M., Olson, A., Spehner, J.-C. (1996) "Reduced Surface: An Efficient Way to Compute Molecular Surfaces", Biopolymers 38: 305-320.
[Sanner 2002] Sanner M., private communication.
[Vorobjev 1997] Vorobjev, Y. N. (1997) "SIMS: Computation of Smooth Invariant Molecular Surface", Biophys. J. 73:722-732