This is a 2-year project with two essential subprojects:
It was urgent to start this work promptly, so that first releases could be ready for the IUCr Congress in Osaka in summer 2008. We expected that discussions and review at that Congress might result in recommendations for additional work outside the scope of the might proposal such as further extensions to DDLm needed for support of synchrotron image data in the IUCr imgCIF dictionary. In that case, appropriate adjustments would have been negotiated at that time. At the time of the Osaka meeting the basic DDLm proposal was accepted, but the precise specification of DDLm was not yet settled. Most importantly, three different reference implementations of dREL were available -- one by Nick Spadaccini, one by Jame Hester and one by John Westbrook. In October 2008 this project decided to align its development with the reference implementation by James Hester.
As versions of the new packages mature they will be released to the community as open source software, without charge, to encourage wide use. The software will be released using the GNU GPL or a similar license. "CIF Applications" articles will be submitted to help make the community aware of these new and upgraded tools.
Work done in the second quarter (November 2007 -- January 2008): The activities for the project in this quarter consisted of a cleanup of the base level of CBFlib (to be formally released shortly as CBFlib_0.7.9), and a start on the changes to the CBFlib C-based parser and to the ciftbx Fortan based parser. The major work was on the design of the logic to handle parsing the "list", "array", "tuple" and "table" entities, and identification of most of the mappings from DDLm to DDL2 needed to create hooks to the existing validation code. In addition the main server for the software development activities in the lab (arcib) was upgraded, providing a more stable development environment with more disk space.
Work done in the third quarter (February 2008 -- April 2008): The activities for the project in this quarter consisted of new code in CBFlib, starting with the release in CBFlib 0.7.9 of the code done in the prior quarter. The work being done then was to be released during the next quarter in CBFlib 0.8 to have it ready for review in Osaka. The C code to parse the three DDLm bracketed constructs ( (), [], and {}) was completed and tested. The approach that was taken is to parse the nested constructs, removing comments and extra whitespace to produce a clean string that is easy to reparse for the important single level cases needed for dictionary validation, but which preserves the full tree structure when needed. In addition, Mr. Todorov has done the regular-expression based infrastructure to validate tag values against the single container DDLm types. The same structure would then be applied iteratively to the multivalues types. The code for both the parse and the validation infrastructure was available in the CBFlib_bleeding_edge module in the cbflib project on blondie.dowling.edu.
Work done in the fourth quarter (May 2008 -- July 2008): The activities for the project in this quarter have consisted of new code in CBFlib and CIFtbx preparing for release 0.8 of CBFlib and release 4 of CIFtbx, both with code to read and validate data items against expanded DDLm dictionaries. The necessary code infrastructure to parse bracketed items and to test data items against the DDLm regular expressions is now in CBFlib. The regular expressions themselves are being validated and adjusted as necessary and CIFTEST2 is being expanded case by case to provide test cases for likely incorrect uses of the new DDLm types. The current pre-release of CBFlib 0.8 in the repository on blondie.dowling.edu is also being used as the CIF processing component in the current pre-release of RasMol to help ensure adequate testing against existing CIF data. We will post as complete a version of CBFlib 0.8 to the web as is ready just before the Osaka IUCr meeting later this month. The CIFtbx infrastructure to parse bracketed items and to test data items is also ready, but the testing will have to be at a simpler level than in the C-version, not using regular expressions, but simply testing on a coarser level similar to the one adopted for testing against the DDL2 data types. That code will also be posted to the web before the IUCr meeting.
The work on CIFtbx4 has raised an interesting issue of how not to break existing Fortran applications in the transition to DDLm. The problem is that older Fortran applications cannot accept strings of arbitrary length. To deal with this, we are following the approach already used in CIFtbx to deal with text fields -- delivering the text in chunks of limited length with a flag set true if there are more chunks to be examined. This will allow existing applications to view bracketed constructs as if they were text fields presenting one item per line. For applications that can be converted to be DDLm aware, new variables giving the depth into a bracketed construct and the index across on the current level should provide appropriate control.
Work done in the fifth quarter (August 2008 -- October 2008): After due consideration of the discussions in the COMCIFS meeting and the imgCIF workshop in Osaka in August, and further comparison of available DDLm software in September, we have settled on the version of DDLm as implemented in James Hester's PyCIFRW as the reference implementation against which we will maintain compatibility and are making the necessary changes in CBFlib and CIFtbx 4.
James Hester's parser was obtained from http://anbf2.kek.jp/downloads/drel-ply-ez_260908.tar.gz with the working copy for this project at http://blondie.dowling.edu/projects/drel-parser/. His parser uses PLY (Python Lex-Yacc), so PLY needs to be installed. It can be downloaded from http://www.dabeaz.com/ply/. The current version of PyCIFRW, which this parser is a part of, can be obtained at http://anbf2.kek.jp/CIF/.
Work done in the sixth quarter (November 2008 -- January 2009):
The major work for this quarter was the creation of cifget, a wget-like utility, by E. Zlateva to gather DDLm dictionaries to be imported from the web. Releases of additional software discussed below should be ready for release in the upcoming quarter.
As noted above, we now have a utility similar to wget to gather DDLm imported dictionaries from the web so that applications (especially Fortran applications) do not need to have web access. A tarball of this utility is available at http://arcib.dowling.edu/cifiucr/cifget_31Jan09.tar.gz. This utility is an important subsystem for both the Fortran code and the C-code. In addition we have been working on the handling of comments and whitespace in both the Fortran code and the C code, and the handling of methods for validation in the C code.
Project Status 2 February 2009.
The most important work for this period was the creation of CIFGET. CIFGET is a DDLm dictionary import expansion utility, which downloads remote DDLm dictionary trees and generates expanded dictionaries that can be used when reading, writing, or validating cif files. It is a shell script, called from the command line in the following format:
./cifget [url] [expanded dictionary destination]
where [url] is the web location of the source dictionary that needs to be expanded.
CIFGET runs a perl script, called cget, which downloads the requested dictionary file. cget is largely modeled on the logic of GNU wget. It mimics the wget recursive download logic, which allows it to follow links and download the dictionary tree in the same way that wget would download an html files tree. To avoid infinite recursion between files that reference each other, after a url's contents have been downloaded, the url is saved into an array. The array is looked at before each download to check if the current url is already in the array. cget parses a dictionary file and finds import tag attributes, using regular expressions. The urls are extracted from the import tags. In addition to downloading the dictionary tree, cget also copies all dictionaries to a working dictionary directory, where Xchek will run to create an expanded dictionary.
CIFGET uses Xchek to create an expanded dictionary from the local-file dictionary mirrors. It takes as a source dictionary the dictionary that was originally requested by the user from the web. Xchek is run from the working dictionary directory, and the expanded dictionary file is saved in the user-specified location. During testing, builds of some expanded dictionaries were prevented by a typographical error in one of the dictionaries on the IUCr web site. Therefore, testing of this package should be done against
http://arcib.dowling.edu/~bernsteh/.cifiucr/cif/ddlm/DDLm_30jan09/TEST_DIC/
until the typo is corrected. The change needed on the iucr web site is to core_struc.dic on the IUCr site:
lines 1682-1683:
_import_list.id (('Att','Cromer_Mann_coeff','com_att.dic'],
['Val','Cromer_Mann_b4','com_val.dic'))
should read
_import_list.id (('Att','Cromer_Mann_coeff','com_att.dic'),
('Val','Cromer_Mann_b4','com_val.dic'))
In this reporting period, work continued on the Fortran code focusing on the design for handling white space and comments, especially in dealing with bracketed constructs. The existing scheme in the Fortran code in CBFtbx for dealing with comments and white space has been event driven – delivering comments and white space on the fly when scanning forward. White space is recorded as status information associated with each token in the form of a column number. This scheme is being retained and supplemented by the following scheme, derived from a scheme discussed on the imgCIF list in September 2008, but using DDLm bracketed constructs to carry the information more compactly and more intuitively:
((((coltp,prologt,colt), (colvp,prologv,colv), (((cole1p,prologe1,cole1),(cole1e,epiloge1)),…), (colve,epilogv))
where coltp is the column number at which any comment before the tag begins, prologt is any
comment before the tag, colt is the column at
which the tag begins, colvp is the column at which any comment before the tag value begins,
prologv is any comment before the tag value, colv is the column at which the column value
begins, colve is the column at which any comment after the tag value begins and epilogv is any
comment after the tag value. The elements in the middle provide the same information for a
bracketed construct. If the tag is part of a loop, then ws_
Work by E. Zlateva and H. J. Bernstein
Project Status 1 November 2008.
In this period we have been working on the write logic for the bracket-delimited constructs in CIFtbx 4 and, as an unexpected consequence of the spirited exchange with James Hester on the best way to handle magic numbers for imgCIF, have worked out what seems to be a reasonable solution to the event-based parsing of bracket constructs with embedded comments. The approach in the fortran-based software is as follows.
The software will offer the option of parsing the bracketed constructs in either of two alternate ways:
White space other than comments will not be delivered as an event. Instead it will be marked by column numbers of the elements that are delivered. There are cases in which the detailed information about elements and comments is critical, but for such activities as extracting and copying portions of DDLm CIFs transfer line-by-line is more efficient.
It is not practical to embed full web access into the Fortran code. Therefore, the logic for dictionary expansion in the Fortran code is being based on local files, with the gathering of dictionaries from remote locations being handled by a separate C program based on wget (see below).
Work by H. J. Bernstein
Project Status 2 August 2008.
In the prior status report (below), we wrote,
"The lessons learned in the coding for C have caused us to rethink the code currently in Fortran. In the past we have preserved all comments in a CIF while doing the Fortran parse, so that the original CIF with all comments can be recreated even when reformatted. As we have discovered in the C code, it is important to strip the embedded comments in the bracketed constructs, and it also may be necessary to have a full tree-expansion of nested bracketed constructs. Maintaining a full tree structure and three-fold replication of all the bracketed constructs is workable in C and even in Fortran-95, but is a non-trivial change in Fortran-77 if reasonable performance is to be achieved. We are exploring alternatives and will resolve the matter in the next quarter."
We have explored the alternatives, which for Fortran-77 would require extending the current use of direct access files to store the tree. The performance hit was too great and we plan not to bring the full tree-structure into the Fortran version, but to stay with the less-demanding event-based logic discussed above. We should discuss this in Osaka.
Work by H. J. Bernstein
Project Status 1 May 2008.
The lessons learned in the coding for C have caused us to rethink the code currently in Fortran. In the past we have preserved all comments in a CIF while doing the Fortran parse, so that the original CIF with all comments can be recreated even when reformatted. As we have discovered in the C code, it is important to strip the embedded comments in the bracketed constructs, and it also may be necessary to have a full tree-expansion of nested bracketed constructs. Maintaining a full tree structure and three-fold replication of all the bracketed constructs is workable in C and even in Fortran-95, but is a non-trivial change in Fortran-77 if reasonable performance is to be achieved. We are exploring alternatives and will resolve the matter in the next quarter.
Work by H. J. Bernstein
Project Status 1 February 2008.
In order to validate CIFs against DDLm dictionaries, we need to parse those dictionaries. Xchek has a minimal partial parse of DDLm, but a full check is best done using a full parse. To upgrade from the existing DDL1 and DDL2 parsers, code has to be added to handle the new syntax for values, which, in addition to the existing unquoted words, single and double quoted strings and "\n;" quoted texts fields in DDL1 and DDL2, includes the new "{ item, item, ...}", "[ item, item, ...]", and "( item, item, ...)" constructs, allowing for the possibility that these mechanisms may be nested, requiring some uses of stacks to save state. We are adding new code and variables to CIFtbx to support these new constructs. This raises an interesting issue in the handling of existing DDL1 and DDL2 CIFS - should the new constructs be allowed in parsing such CIFS. Allowing them helps to encourage the community to move up to DDLm, but weakens strict validation of CIFs for compliance with DDL1 and DDL2. The compromise we propose is to make acceptance of the new constructs the default and to add a control flag to allow applications to revert to strict checking against DDL1 and DDL2.
We hope to have the first pass of these parser upgrades to ciftbx for these constructs ready for distribution by mid-February and the upgrade to CBFlib ready for distribution by the end of February. The code will be posted on the blondie.dowling.edu, our GFORGE server.
Work by H. J. Bernstein and G. Todorov.
Project Status 31 October 2007.
Prior to the actual start of the project, Syd Hall had provided a release of Xcheck. That kit was unpacked and reviewed and permission was requested from Hall, Spadaccini and Westbrook to move the code under the GPL. G. Todorov set up a website at blondie.dowling.edu/projects/ddlm that contains all available information on the specifications, prototype dictionaries etc. Blondie is a collaborative development environment using GFORGE. Much of the validation functionality of Xcheck is closely related to the vcif2 validation code in CBFlib (see below).
We believe, after a review of the currently available software, that xcheck.f and indic.f are the only available tools for parsing ddlm and are a good starting point. They do not implement dREL yet.
Work by H. J. Bernstein and G. Todorov.
Project Status 2 February 2009.
We have prepared the hooks in CBFlib to use James Hester’s PyCIFRW-based DDLm parser from within CBFlib and will be working further on that code in the upcoming quarter using the expanded dictionaries provided by CIFGET. Work has continued in CBFlib on the handling of comments and whitespace in the manner described above.
Work by E. Zlateva and H. Bernstein
Project Status 1 November 2008.
As noted above, as a result of the COMCIFS discussions in Osaka and subsequent inquires by email, we have settled on James Hester's PyCIFRW-based DDLm implementation as the reference implementation for our work. Work has begun on prototype integration with CBFlib via system calls as an interface to python when methods need to be executed. Faster IPC calls will be substituted later. A similar approach to the one being followed in CIFtbx is being followed for the management of bracketed constructs with embedded comments in CBFlib. Comments at higher levels are being handled with the "ws_" constructs discussed in the interchange with James Hester subsequent to the COMCIFs meeting.
In order to deal with import directives both for CBFlib and CIFtbx, E. Zlateva is working on a separate utility based on wget to gather local copies of the necessary dictionaries recusively with loop breaking in the same way that wget handles html file links. Once the dictionaries are downloaded, CBFlib will parse _import tags and copy required definitions into an expanded dictionary.
Version 0.8 of CBFlib was released prior to the IUCr meeting in Osaka. The current work is being integrated into the CBFlib_bleeding_edge module that will become the 0.9 release before the end of this calendar year,
Work by E. Zlateva and H. Bernstein
Project Status 2 August 2008.
In this quarter the paper on vcif2 was accepted by the Journal of Applied Crystallography [G. Todorov and H. J. Bernstein , "VCIF2: extended CIF validation software," J. Appl. Cryst. (2008). 41, 808-810]. Checking of data values against DDLm dictionaries in CBFlib began. New test cases have been added to CIFTEST2. The external specifications of the methods checking are to be discussed at the COMCIFS meetings in Osaka, and we expect to implement what is decided during the coming quarter.
Work by H. J. Bernstein, G. Todorov, E. Zlateva, N. Darakev.
A note on the DDLm import logic and dictionary layering: There will be a discussion of DDLm import logic and dictionary layering at the Osaka meeting. The view has been presented that dictionaries should be handled in small segments with multi-pass real-time web-loading of multiple dictionaries to compose on virtual dictionary. An alternative approach is for applications to use complete, "expanded" dictionaries. There are problems with both approaches. In the first case, there are serious platform and performance issues. In the second case, there are serious issues of dictionary synchronization. In this project we are addressing these issues in a modular way. We have started modifications to the open source web-page mirroring program wget to allow it to handled web-based dictionary caching into local mirrors. We are adopting the logic in the original Xcheck to then work from local-file dictionary mirrors to local expanded dictionaries. This should allow appropriate local choices in balancing the performance and synchronization issues and should make it much easier to support a wide range of platforms. We will discuss this further in Osaka.
Project Status 1 May 2008.
In this quarter a paper on vcif2 was submitted to the Journal of Applied Crystallography. The draft is being revised in response to reviewer comments. The necessary infrastructure was added to CBFlib for checking of data values against DDLm regular expressions. As a practical matter, code to automatically handle methods given in DDLm dictionaries will have to be based either on Python or Java. Python has long been an opne source language. Inasmuch as java has recently become open source, we are exploring both alternatives.
Work by H. J. Bernstein and G. Todorov.
Project Status 1 February 2008.
Xcheck and Cyclops prepare lists of names from the dictionaries and compare them to the names in documents. In the course of doing that the syntax of the dictionaries are checked, but not the syntax of the document. In vcif3 (vcif2 adapted to DDLm and dREL) we will be checking the syntax of the document as well. Most of the checking is conceptually identical tp the vcif2 checking, but the information is conveyed by different tags or in a slightly different way. The major significant difference is that DDLm and dREL allow methods to algorithmically state the relationships among values of different tags. This last aspect is most easily handled in a C-like context, such as CBFlib (the base for the current vcif2) This requires a rework for CBFlib not only of the parser (a task similar to what is being done to the ciftbx parser) but of the data structures to support efficient access to the elements of lists, arrays, tuples and tables. We have begun the parser changes. The data structure and method interpretation changes will almost certainly extend into the second year of the project. In this quarter the current vcif2 has been more fully documented and the design of the new data structures began.
Work by H. J. Bernstein and G. Todorov.
Project Status 31 October 2007.
As noted above, a website was set up at blondie.dowling.edu/projects/ddlm that contains all available information on the specifications, prototype dictionaries etc.
The original material provided and the October update were both added to the site on blondie. G. Todorov reviewed this material and it appears that s full ddlm dictionary is not available yet, which makes the implementation of the specifications more difficult. However dREL ( the method definition language ) is well defined and is a good starting point for implementation. The main differences of ddlm from ddl1 and ddl2 and test cases can be generated easily following the provided specifications.
G. Todorov applied the vcif2 validation to the new dictionaries. CBFlib reads the dictionaries without breaking, which means that the major part of the validation provided by xcheck already exists. That will be used as a base.
Work by G. Todorov.
Project Status 2 February 2009,
No significant issues or activity on infrastructure in this quarter.
Project Status 1 November 2008.
The infrastructure has been functional this quarter despite some serious problems with the sourceforge server and some minor problems with fan and UPS failures.
Work by H. J. Bernstein and N. Darakev
Project Status 2 August 2008.
In this quarter the remediation of hardware problems that were addressed in the previous quarter were completed. The backup server was replaced with a system with 3 TB of storage. In the previous quarter the CBFlib CVS was replicated from the GFORGE server to sourceforge in the cbflib project, and that sourceforge CBFlib is now heavily used. As noted in the prior report. By the time of the Osaka meeting we expect to complete the move of the code and web pages for this project to sourceforge for the convenience of the community. The primary development activities will continue on the GFORGE server.
Work by H. J. Bernstein, G. Todorov and N. Darakev
Project Status 1 May 2008.
In this quarter additional hardware problems arose. The GFORGE server was replaced and the necessary new disks to replace the file backup server are being tested now. The replacement of the backup file server should be completed in the next few days. A vcif.org domain was purchased, a sourceforge vcif project started, and the vcif2 validation web page was placed at www.vcif.org. As vcif3 matures, its features will be added at that site. The CBFlib CVS has been replicated from the GFORGE server to sourceforge in the cbflib project. By the time of the Osaka meeting we expect to have the code and web pages for this project all available on sourceforge for the convenience of the community. The primary development activities will continue on the GFORGE server.
Work by H. J. Bernstein and G. Todorov
Project Status 1 February 2008.
In this quarter the main development machine for our lab, arcib.dowling.edu was replaced with a faster, more reliable machine with more disk space.
Work by G. Todorov and D. O'Brien.
For August 2004 through August 2006, with funding from the International Union of Crystallography, we worked on improvements to CIFTEST, vcif, and CIFtbx as well as a new line folding package to help provide new and upgraded CIF software to facilitate publication in IUCr journals. The work drew and and intersected with other work support by grants from the U.S. National Science Foundation and the U. S. Department of Energy. The result was a set of packages of open source software:
Website: http://www.bernstein-plus-sons.com/software/ciftest
Download: CIFTEST_2.1.tar.gz
Website: http://arcib.dowling.edu/vcif
Download: vcifHTML.tar.gz CBFlib_0.7.6.1.tar.gz CBFlib_0.7.6_Data_Files.tar.gz
Website: http://www.bernstein-plus-sons.com/software/ciffold
Download: CIFFOLD_0.5.4.tar.gz
Website: http://www.bernstein-plus-sons.com/software/ciftbx
Download: ciftbx_3.0.4.tar.gz
In order to support the evolving needs of the community for new and upgraded CIF software to facilitate publication in IUCr journals, we have established a CIF software support effort at Dowling College under the direction of Herbert J. Bernstein, a major CIF software developer, leveraging the infrastructure already in place in Dr. Bernstein's lab for bioinformatics software development.
G. Todorov presenting poster on vcif2 and G. Todorov and G.
Darakev talking to R. Grosse-Kunstleve about CBFlib at ACA 2006 in
Hawaii.
Foils of presentation on the project for XX Congress IUCr, Florence, IT, 23-31 August 2005.
I. Awuah Asiamah, K. Mitev and G. Todorov preparing the
presentation for IUCR 2005 and G. Todorov and K. Mitev presenting at IUCR2005
MS 86.
Among the sub-projects are:
Current IUCr release: journals.iucr.org/iucr-top/cif/developers/trip/ by Brian McMahon 10 May 2000
Project Status 3 September 2006: The work done on the other packages, including CIFtbx 3.0.4 and CBFlib 0.7.6.1, has been integrated into a new release CIFTEST 2.1. The test package illustrates three important approaches to comparing output CIFS. In order to resolve the variations in the handling of one-row loops, for the cif2cif section CIFTEST uses cif2cbf as a filter to resolve the ambiguity as discussed in the prior report. In order to resolve the variations in handling of leading zeros, for the cif2pdb section CIFTEST uses sed scripts to deal with that issue. The release for CIFTEST is on the project web site and at:
http://www.bernstein-plus-sons.com/software/ciftest
Work by H. J. Bernstein based on the kit produced by G. Todorov.
Project Status 1 June 2006: The management of one-row loops has been further investigated and is proving both interesting and complex. In most cases, one-row loops appear to be best handled as tag value pairs, except when presented with a significant number of matrix or vector elements. Such stylistic variations will require a replacement of straight comparison of tests cases with a pass through a program like cif2cbf to get to a uniform presentation. Work by H. J. Bernstein.
Project Status 1 April 2006: In the course of the new work on vcif (see below), some new test cases have been generated, including tests for the existing binary format. These will be integrated with ciftest when the vcif release is ready. One of the interesting issues is the handling of a one-row loop either as a loop or as a series of tag-value assignments. Work by H. J. Bernstein.
Project Status 1 February 2006: The framework for CIFTEST2 created by G. Todorov, making B. McMahon's trip package more general, has been extended to include the vcif test cases. Environment variables have been used to make it easier to customize the choices of programs to be used for the tests. The first release candidate of CIFTEST2.0 is at:
http://arcib.dowling.edu/~bernsteh/software/CIFTEST2.0
Work by G. Todorov and H. J. Bernstein.
Project Status 1 December 2005: The framework for CIFTEST2 has been created, which allows a modular testing of multiple cases that involve a variety of CIF packages. The CIFFOLD test cases have been integrated into CIFTEST2 shortly. Testing has begun and a copy has been made available to B. McMahon. Formal release is expected in the next reporting period. Work by G. Todorov.
Project Status 1 October 2005: New test cases have been created as part of the work on CIFFOLD, and released in the CIFFOLD kit. These cases will be back integrated into CIFTEST shortly.
Project Status 1 August 2005: Continuing from the prior reporting period the intensive work on CIFFOLD has resulted in interesting new test cases for long lines and for cases that might break parsers. I. Awuah Asiamah is cleaning up and organizing these test cases. Some of the test cases have become part of the "make tests" section of CIFFOLD already. Work by I. Awuah Asiamah, K. Mitev and H. J. Bernstein.
Project Status 28 May 2005: The intensive work on CIFFOLD has resulted in interesting new test cases for long lines and for cases that might break parsers. These cases need to be cleaned up and revised to avoid intellectual property issues, but will become a significant contribution to the test suite. Work by I. Awuah Asiamah and K. Mitev.
Project Status 30 Jan 2005: The script, runtest, was revised to use new vcif and to handle command line arguments for all test cases. (see ciftest_1_2) New test cases (see below) not yet incorporated to ensure that current test behavior will be reproduced using the evolving vcif (see below).
Project Status 5 Dec 2004: S. Louris has prepared initial test cases for CIF 1.1 line folding (see ctc001).
Current IUCr release: www.uk.iucr.org/iucr-top/cif/software/vcif/index.html. The IUCR has re-released vcif under the GPL. That version has been posted at arcib.dowling.edu/software/vcif/ as vcif 1.1.
Project Status 3 September 2006: The work on vcif and CIFTEST was presented as a poster by G. Todorov at the summer 2006 ACA meeting in Hawaii. The poster was well-attended with lively discussions.
As discussed in the prior report, we settled on CBFlib as the base both for the vcif2 syntax checking and for parent-child and category checking. Type checking is hard- coded, rather than dictionary regular expression driven. The dictionary regular expressions were just not solid enough. Mr. Todorov has packaged the checking in the form of a web page and php script at
Work by H. Bernstein, G. Todorov and G. Darakev with consultation by K. Mitev.
Project Status 1 June 2006: The abstract for the summer 2006 ACA meeting in Hawaii on the work on vcif and CIFTEST has been accepted for a poster presentation.
We have settled on CBFlib as the base both for the vcif2 syntax checking and for parent- child and category checking. Type checking is hard-coded, rather the dictionary regex driven. The dictionary regex expressions are just not solid enough. The alias code was taken from release 0.7.5 to 0.7.6 and works. As noted above, we are reworking the line folding in CBFlib. The hash-table performance seems to be good enough to make the use of a formal database package unnecessary to achieve the goals of this project for most realistic cifs, and the use of CBFlib under the GPL or LGPL should address the concerns Brian raised.
Work by H. Bernstein in consultation with G. Todorov, K. Mitev. Some CBFlib testing by A. Hammersley and J. Wright at ESRF.
Project Status 1 April 2006: An abstract for the summer 2006 ACA meeting in Hawaii has been prepared on the work on vcif and CIFTEST.
As noted in the prior bi-monthly report, Mr. Mitev proposed an interesting approach to validation involving the use of a postgres database for the layered dictionaries to then produce a ghost schema against which to validate CIFs. Brian McMahon raised some concerns about making the system dependent on auxiliary software that not all users might have available. We remain convinced that the key to full CIF validation, especially for DDL2-based CIFS, is to make use of the rules for relational databases, which is most efficiently done with an SQL server. However, Brian McMahon's concerns are valid. Therefore we are structuring the new code to allow for use of either an internal or an external database. Handing the second option, however, requires a significant upgrade to the API we are using (derived from the CBFlib API), which we have begun and which is also bearing fruit for imgCIF. We have extended the CBFlib parser to support save frames so that DDL2 dictionaries may be read, and have added hash table-based searches similar to the ones we use in CIFtbx to achieve acceptable performance. For your information a current snapshot of the parser bison grammar is available at
http://arcib.dowling.edu/~bernsteh/.cifiucr/cbf_stx_1Apr06.y
We have also made arrangements with SSRL to bring the CBFlib API under the GPL or the LGPL as alternative licenses. The parse of both DDL1 and DDL2 dictionaries is working, and we are now working on the dictionary layering code and denormalization of the item and category attributes from scattered tables to a smaller number of larger hash-indexed tables. The first tests of the dictionary parse code will be of tag aliasing. This will be done in collaboration with colleagues at ESRF who are interested because of the utility of this feature in handling older deprecated imgCIF categories in newer CBF files. We hope to incorporate the new parser and dictionary handling into the CBFlib 0.7.5 release to be posted to the web shortly, so that this critical code will get more extensive testing. Work by H. Bernstein in consultation with G. Todorov, K. Mitev. Some CBFlib testing by A. Hammersley and J. Wright at ESRF.
Project Status 1 February 2006: Mr. Mitev has proposed an interesting approach to validation involving the use of a postgres database for the layered dictionaries to then produce a ghost schema against which to validate CIFs. This appears to be the best option for implementing the additional validation needed for vcif2. This will be investigated further in the next reporting period.
Project Status 1 December 2005: There is nothing new to report on vcif at this time.
Project Status 1 October 2005: There is nothing new to report on vcif at this time.
Project Status 1 August 2005: Since I. Awuah Asiamah has become the primary tester for the packages in this project, he is taking over responsibility for vcif. If the visa issues can be resolved in a timely manner, I. Awuah Asiamah will attend the IUCr Congress in Florence (as will K Mitev,G. Todorov and H. J. Bernstein) which will be helpful in discussions relative both to CIFTEST and vcif.
Project Status 28 May 2005: Work done recently on CIFFOLD includes new code for syntactic validation, that is being considered for incorporation into vcif. This code is able to provide useful analyses even when the data block is not specified. Work by K. Mitev.
Project Status 30 Jan 2005: Changes made to bring code closer to current ANSI-C conventions and to avoid conflicts in building for some platforms, such as MS Windows. Command line option added to specifiy CIF level and command line processing revised to allow both long and short argument names. Updated code tested to ensure ability to process original CIFTEST cases with unchanged output. (see work in progress vcif002.patch -- incorporates changes below -- do not apply both patches). HJB
Project Status 5 Dec 2004: Mods prepared to extend line length (see work in progress vcif001.patch). Preliminary tests by S. Louris, continued by H. Bernstein.
Current IUCr release: none
Project Status 3 September 2006: On reflection and after some experimentation, we concluded that there was no need to add the deprecated "\;" semicolon escape convention to ciffold, since conversion for files using the incorrect convention can be done either with by sed or by the old CBFlib, and it would be best not to encourage the writing of additional incompatible files. If there is objection to this approach at the IUCr, we can put out the code to support the old convention, but absent such objection, we believe the project goals for ciffold have been met.
The release for ciffold is on the project web site and at:
http://www.bernstein-plus-sons.com/software/ciffold
Work by H. J. Bernstein.
Project Status 1 June 2006: The logic in ciffold in being reviewed with respect to the handling of folding which places a semicolon in column 1 of the next line. Ciffold handles this by doing the fold one character earlier. Ciffold allows a semicolon to be moved to column 1 if it is not followed by a blank or tab. CBFlib follows a different convention, escaping such a semicolon with a leading backslash, which conflicts with the IUCr convention of treating a "\;" as an ogonek. CBFlib has been changed to conform with the Ciffold convention, but, since there are datasets in the field that have used the CBFlib convention, we will add it as a deprecated option in both packages. Work by H. J. Bernstein. The review of semicolon handling started with a discussion with Mr. Mitev.
Project Status 1 April 2006: The release candidate discussed in the prior bimonthly report has been made the default ciffold release at:
http://www.bernstein-plus-sons.com/software/ciffold
Work by K. Mitev and H. J. Bernstein
Project Status 1 February 2006: The problem identified by Mr. Mitev at the end of the prior reporting period was investigated. The problem was one of insufficient blank stripping at the ends of lines within text fields. Mr. Mitev proposed a solution which addressed that problem, but, in the course of integrating the fix, a subtle problem was found in the mapping of single line quoted strings presented as folded text fields to apostrophe- and double-quote- quoted strings. In particular there are cases that should be left as folded text fields and that also bring to light important test cases to be added to vcif and CIFTEST. A fix for the problem is being tested in the next release candidate for CIFFOLD which should be posted for external release after additional testing. The release candidate is at:
http://arcib.dowling.edu/~bernsteh/software/CIFFOLD_0.5.4
Work by K. Mitev and H. J. Bernstein
Project Status 1 December 2005: At the end of this reporting period Mr. Mitev reported a potential problem with the handling of unfolding of folded text fields containing lines that end with the sequence backslash-blank. This is being investigated further by H. Bernstein.
Project Status 1 October 2005: During this reporting period CIFFOLD was presented in Florence and B. McMahon requested a new mode of operation in which only lines that were longer than 80 characters would be folded and other lines would remain unchanged to simplify reporting of changes to authors during the IUCr publication processes. The change was implemented in September by adding a new command line option ("-n") for minimal folding as well as a new ncurses menu page. The updated code was just released as version 0.5.3 and made the default at:
http://www.bernstein-plus-sons.com/software/ciffold
Work by H. J. Bernstein, with consultation by K. Mitev.
Project Status 1 August 2005: During this reporting period, the program CIFFOLD was tested and documented and progressed from release 0.4.3 to release 0.5.1 as problems were discovered and corrected and the quality of the ouput was improved (e.g. to fold between blank separated words in the style of CIFtbx3, rather than precisely on column 80 as in earlier releases of CIFFOLD). The code has been working well, and a link has been placed on the public web page for CIFFOLD to allow retrieval of folded versions of the long-line mmCIF datasets currently being released by the RCSB PDB. As of this writing the 0.5.1 release is the default release. The 0.5.2 release, currently available for testing at http://www.bernstein-plus-sons.com/software/CIFFOLD_0.5.2 changes the handling of folded long single or double quoted strings to include a terminal backslash in the resulting text field, so that an extra newline will not be added on reconstruction, and also deals with additional cases of embedded "; " sequences in text fields that might end up in column 1. If this release does well in testing, it should be the default release during the Congress. Work by K. Mitev, G. Todorov and H. J. Bernstein, with testing by I. Awuah Asiamah.
Project Status 28 May 2005: After testing and comments by the students in the lab, a full release (CIFFOLD_0.1) was prepared and posted at www.bernstein-plus-sons.com/software/ciffold in mid April 2005. Further testing by the students and by B. McMahon resulted in rapid evolution of the code to a reasonable stable release (CIFFOLD_0.4.3) on 14 May 2005. The work on cif2cif using the CIFtbx3 release was fed back into the work on CIFFOLD. Work by K. Mitev, G. Todorov and H. J. Bernstein, with testing by R. Chachra, C. Chigbo, S. Louris and, especially, I. Awuah Asiamah. Helpful comments provided by B. McMahon.
Project Status 2 April 2005: K. Mitev has prepared a full GUI release for testing. (see the current state of the release (ciffold003). The release is being actively tested by others in the lab to see if it can be gotten ready for release and for inclusion in the ITVG CDROM this month. A tar for others who wish to test this prerelease is available: CIFFOLD.tar.gz. Work by K. Mitev and G. Todorov, with testing so far by H. Bernstein, G. Todorov, R. Chachra, I. Awuah Asiamah and S. Louris.
Project Status 30 Jan 2005: K. Mitev and G. Todorov are working on a GUI front end and integrity checking. (see the work in progress ciffold002).
Project Status 5 Dec 2004: K. Mitev is working a this code (see the work in progress ciffold001).
Current IUCr release: www.iucr.org/iucr-top/cif/software/ciftbx3/README.html (from this project) and http://www.iucr.org/iucr-top/cif/software/ciftbx/README.html (the prior version).
Project Status 3 September 2006: The vcif2 validation code has been adapted from C to Fortran and incorporated into CIFtbx version 3.0.4 to provide "extended integrity checking comparable to that in vcif2". The full patch code to convert from CIFtbx 2.6.4 to CIFtbx 3.0.4 is available at
http://arcib.dowling.edu/cifiucr/ciftbx003.patch
and a release kit is available at
http://arcib.dowling.edu/cifiucr/ciftbx_3.0.4.cshar.Z
The full release of the enhanced CIFtbx3 is available on the project web site (see above) and at
http://www.bernstein-plus-sons.com/software/ciftbx
We are pleased to report that Syd Hall recently agreed to allowing the LGPL as an alternate license to the GPL for the API. We have incorporated the necessary license revisions into this patch and the kit. Since completion of the project, we have released the current versions of cif2cif, Cyclops, cif2pdb and cif2xml based on CIFtbx 3.0.4 rather than on CIFtbx 3.0.3 and reflecting the improvement in the license situation.
The specific changes made to CIFtbx in the transition from the 3.0.3 release to the 3.0.4 release were:
The 3.0.4 release completed the extension of validation to include checking for missing parents and validation of data values against dictionary-specified ranges and enumerations. Failure to provide dictionary-specified mandatory items is also reported. The new dict_ check codes 'parck' and 'parno' turn on and off checking of parent-child relationships. The default is 'parck'. The new character variable dicpname_ returns the dictionary-specified parent of the name in dicname_. The meaning of the existing dict_ check code 'dtype' has been extended to include data input checks for compliance with dictionary-specified ranges and enumerations, and type checking is more rigorous than in the past. The new logical variable valid_ is set .true. if an input data item has been validated against the dictionary. If no dictionary type check is specified or the item does not conform, valid_ is .false. Warning are also reported for validation failures. Additional internal name changes to avoid conflicts were made:
tbxxpcat procat
tbxxsstb <new>
tbxxfstb <new>
tbxxnid newdent
tbxxoid <new>
We believe the project goals for CIFtbx have been met. Work by H. J. Bernstein
Project Status 1 June 2006: Nothing to report for this period. We will return to this after the vcif validation upgrade to incorporate similar changes. Note, however, the discussion of semicolons in ciffold and CBFlib, below. Similar changes are being prepared for CIFtbx. Work by H. J. Bernstein
Project Status 1 April 2006: Nothing to report for this period. We will return to this after the vcif validation upgrade to incorporate similar changes. We note that CIFtbx_3.0.3 downloads have risen from 20 per week to 30 per week in this period.
Project Status 1 February 2006: The prior release of CIFtbx3 (3.0.2) and the new release (3.0.3) were tested on various platforms and, after careful review of the results CIFtbx 3.0.3 was made the default release on 18 January 2006 at
http://www.bernstein-plus-sons.com/software/ciftbx
It should be noted that the CIFtbx test cases are now also incorporated into CIFTEST.
Work on CIFtbx by G. Todorov, J. Jemilawon and H. J. Bernstein.
Project Status 1 December 2005: The next major phase of work with CIFtbx is augmentation of the integrity checking. In order to do this, further performance improvements are needed, In this time period the performance of CIFtbx was improved by reworking portions of the code to use of counted strings to avoid unnecessary replication of trailing blanks in the larger buffers created for handling the long lines of CIF 1.1, and increasing the number of pages kept resident, so that it will be feasible to make additional accesses to the data from dictionaries. The cif2pdb Makefile in the CIFtbx package was cleaned up. The code was released for testing as release 3.0.3 of CIFtbx at
http://www.bernstein-plus-sons.com/software/ciftbx_3.0.3/
Work on CIFtbx by H. J. Bernstein and G. Todorov.
Project Status 1 October 2005: During this reporting period the package was given further testing as the base for a new version of the program cif2pdb being used in another project. The package was upgraded early in August to release 3.0.2 to correct the handling of an index in dtype and to add new types from the PDB extensions dictionary. The updated version is now in use in support of the web page at
http://biomol.dowling.edu/WPDB
which is part of a project funded by the U. S. Department of Energy on creation of a new, wide PDB format. Work on CIFtbx by H. J. Bernstein. Work on the DOE project is a collaboration between F. C. Bernstein and H. J. Bernstein.
Project Status 1 August 2005: As reported in the prior reporting period, fully operational folding and unfolding with acceptable performance on long lines was integrated with CIFtbx and released as CIFtbx3. During this reporting period, CIFtbx3 was made the default release of CIFtbx, and, with S. R. Hall's approval, released under the GPL. The package was given extensive testing as the base for a new version of the program cif2pdb being used in another project, and worked corrected with excellent performance.
Project Status 28 May 2005: Fully operational folding and unfolding with acceptable performance on long lines was integrated with CIFtbx. The program cif2cif was upgraded to include options for folding and unfolding, making it an alternative to CIFFOLD and, more importantly, a template for Fortran programmers on how to adapt a Fortan application to use of CIFtbx3 for long lines. The first release of CIFtbx3 (ciftbx_3.0.0) was released at www.bernstein-plus-sons.com/software/ciftbx_3.0.0. There was an upgrade to www.bernstein-plus-sons.com/software/ciftbx_3.0.1 on 7 April 2005. This version seems to be reasonably stable. Links have been created from www.bernstein-plus-sons.com/software/CIFtbx3 to the ciftbx_3.0.1 release and from www.bernstein-plus-sons.com/software/CIFtbx2 to the ciftbx_2.6.4 release. The primary software distribution link for ciftbx will be upgraded from CIFtbx2 to CIFtbx3 during the next reporting period. We are pleased to note that the IUCr web site has the CIFtbx 3.0.1 release. Work by H. Bernstein with testing by K. Mitev and others.
Project Status 2 April 2005: The performance issue uncovered in the last cycle has been reasonably well addressed. The code for folding is written and the code for comment unfolding is written. With the addition of the code for text unfolding, this version may be ready for release and inclusion in the ITVG CDROM this month. (see work in progress ciftbx002.patch). Work by H. Bernstein.
Project Status 30 Jan 2005: Work on this package has brought to light serious performance issues is working with large numbers of large character strings in Fortran. The code of CIFtbx is being reworked to use representations of strings more appropriate to working in Fortran, combining trailing-blank-trimming and run-length-encoding.
Project Status 5 Dec 2004: Mods in progress to extend line length and to do folding (see work in progress with code for folding ciftbx001.patch). Work by H. Bernstein.
This is a major set of inter-related projects, expected to take more than two years to complete. A phased release to Chester of partial preliminary versions of all of these packages will be made on this web site and feedback from Chester will be used to guide completion of the packages. Comments and suggestions by other interested parties would be appreciated.
As versions of these packages mature they will be released to the community as open source software without charge to encourage wide use. The software will be released using the GNU GPL license.