diff --git a/doc/html/XML_DTD/DesignNotes.html b/doc/html/XML_DTD/DesignNotes.html new file mode 100755 index 0000000000..12d587dd89 --- /dev/null +++ b/doc/html/XML_DTD/DesignNotes.html @@ -0,0 +1,419 @@ + + + + + + + + +

+The XML DTD for HDF5:  Design Notes

+April 28, 2000 +

+1.  Introduction

+The XML "Document Type Definition" (DTD) for HDF5 is a markup language +to describe the contents of an HDF5 file.[1]  This +DTD specifies a standard for using XML to describe the structure and contents +of a single HDF5 file.  The DTD can be used in a variety of +ways, by standard software and by application specific software that builds +on standard XML features.  The DTD will enable descriptions of HDF5 +files to be used with and trasnslated to other similar XML markup languages. +

This document discusses some of the key features of the HDF5 DTD, and +some of the design decisions that were considered during its development. +

The HDF5 data model is somewhat complex, with a great deal of flexibility +and expressive power.  The DTD is intended to be able to describe +almost any HDF5 file, and to describe most of the details of the file.  +For these reasons, the HDF5 DTD is more complicated than some similar DTDs, +such as XDF[7] and netCDF[8]. +

The DTD to some extent is redundant with the previously published "Data +Definition Language", and the accompanying "h5dump" and related tools.[2]  +XML descriptions will contain similar or identical information as the dumper +DDL.  The important difference is that the XML is machine readable, +but not necessarily human readable, and XML is a standard format that can +be exchanged with standard tools and other XML languages. +

The DTD defines a formal and machine verifiable syntax which +is rigorously enforced by validating XML tools.  This guarantees that +the producer and consume can exchange the description.  The rules +of XML will guarantee that the description is syntactically correct and +follows the grammar defined in the DTD.  However, XML cannot assure +that a particular XML description is a correct description of the HDF5 +file, or even that it follows all the semantic rules of HDF5.  For +example, the XML descritpion can assure that every Dataset element belongs +to at least one enclosing Group element, but can't assure that the Dataset +is in the correct Group, or that the Dataset has the correct name, type, +etc.  The overall correctness of the XML description must be assured +by tools that generate the XML. +

+2.  Requirements and Use Cases

+An important goal is for the DTD to be useful for a variety of purposes.  +For this reason, we considered a number of "Use Cases".  This analysis +showed that there are indeed many different uses for XML, which have different +requirements.  Our DTD is intended to support as many of these uses +as possible.  In this section, seven use cases are described and discussed. +

2.1  Case 1:  Viewing Structure and Contents of HDF5 File +Using a Web Browser +

XML descriptions of HDF5 files will be readable by standard Web browsers.  +Some standard Web browsers will be able to display the XML directly, and +many servers will be able to generate HTML from XML on the fly.   +We will also be able to construct style sheets to control the rendering +of information about HDF5 files. +

This use of XML will make it easy to get at least a general view of +the contents of an HDF5 file without any special software. +

2.2  Case 2:  XML as a Catalog Record +

One use of XML will be for catalog records, e.g., at a NASA DAAC or +similar archive.  The contents of HDF-5 files will be described in +XML records which will be stored in a database or otherwise served through +search services.  The XML will be delivered to clients or proxies +without the original HDF-5 file.  The client will use the records +to locate and obtain the datasets they want. +

In this use, the XML is likely to be used separately from the original +HDF-5 file, and any 'pointers' to the file must have complete URLs and +other information, in order to locate the actual dataset. +

In general, the purpose of these records is not to deliver all the data, +nor to reconstruct the contents of the file from the XML.  However, +the content of the attributes is likely to be vital.  Also, +there may be a desire for the records to be comparatively compact, as you +might be searching thousands of candidate datasets, and hence might receive +thousands of XML descriptions. +

It is difficult to know what kinds of searches will be required.  +However, it seems likely that details such as format version and storage +strategy are less likely to be of great interest, compared to searches +on the attributes. +

2.3  Case 3:  XML as an Intermediate Form for Programs +

A second case is using XML as a machine readable description of HDF-5 +files while manipulating the file itself.  An HDF5 editor is an example +of this kind of use.  Here, the XML and HDF-5 file are both accessible +to the program. +

It is difficult to generalize what applications might intend to do.  +For purposes of this use case, assume the following: +

The XML description of the file is used because the application +wants to use standard XML tools and interfaces to manipulate the objects.  +For instance, standard packages will read XML into DOM objects, which provides +not only standard data structures for the XML objects, but also standard +interfaces for manipulating the tree (insert, delete, etc.).  There +are already standard editors for XML trees.
+ +


The notion of this use case is that you would build tools for HDF-5 +by extending such standard XML functions.  The main trick will be +to keep the XML and the HDF-5 correlated.  When an XML object is created +or changed, the tool will perform the equivalent operation on the file. +

Thus, in this case, it is very important for the XML to be closely related +to the structure of the HDF-5 file, and that this be maintained.  +This contrasts to Case 2 (a catalog) where the XML might be generated once, +and possibly used many times without ever accessing the file.  Also, +we definitely want the XML to point to the objects in the file, whether +the data is in the XML or not. +

Depending on the application, many details of the file may be needed, +certainly including things like storage strategy.  However, since +the file is available, these things can be obtained from the file rather +than XML.  This means that much of the detail could be optional for +the XML. +

2.4  Case 4:  Generation, Validation, and Reconstruction +of HDF-5 +

A third case for using XML is as a tool for validating, comparing, or +generating HDF-5 files.  We have proposed tools for checking, correcting, +and diff-ing HDF-5 files, which might use XML as a canonical description +of the file.  Similarly, an 'h5gen' utility might well use XML as +the template to create HDF-5 files. +

These applications need to be able to represent essentially everything +about the HDF-5 file.  In the case of a validator or diff-er, even +boot block information is important. +

Also, it may well be the case that the data must be included in the +XML, either because the HDF-5 file is not available, or because it must +be arranged in a canonical form for comparison, e.g., to confirm that two +files have the same contents. +

While it is necessary for everything (or "everything important") to +be in the XML, it is not necessary that the XML representation itself follows +all of the rules of HDF-5.  For instance, it is not required that +the XML objects are in the same order as the HDF-5 objects (if such can +even be determined), or that storage offsets in th eHDF5 file are faithfully +represented in the XML. +

2.5  Case 5:  XML as Intermediate to Other Formal Languages +and File Formats +

XML is ideally suited for automatic transformation into various formal +languages, either directly or via additional XML languages.  For example, +an XML description of an HDF5 file could be transformed into ODL.[citataion?]  +Similarly, XML can be transformed to other XML languages, such as XDF[7]. +

XML may also be a good intermediate language for translating between +file formats.  For example, the XML description of HDF5 could be transformed +into the XML description for netCDF, and then the data could be written +as netCDF. +

It is likely that there will be "hub" languages, such as XDF, that are +very general languages for data.  Translating from HDF5-XML to XDF +will lose information, but will then make the data translatable to any +other format that can be mapped to XDF.  Similarly, data could be +imported to HDF5 from any format that can be translated to XDF, albeit +with some loss of information. +

It should also be noted that an XML description of HDF5 could be used +to transform or translate individual objects from a file.  For example, +an HDF5 file might contain several datasets, one of which can be mapped +to an OGIS gridded map.  In this case, software could read the XML, +locate the datasets that can be handled, and translate them to OGIS XML +or other OGIS representations.  In this way, similar kinds of data +can be made to work together regardless of storage format, and without +requiring that the entire file be limited to a particular kind or format +of data.  This would be a very powerful tool for sharing data. +

2.6  Case 6:  Store XML in Archive or in Dataset as Machine +Readable Documentation +

The XML description of an HDF5 file is a promising candidate to be a  +machine readable format to be stored in archives.  The XML would likely +be interpretable in the future, and could be mapped to whatever technology +is available. +

In this scenario, the XML should contain sufficient information to access +and translate the data if necessary. +

One variation of this theme is to store the descriptions of the files +in a repository, while the files may reside in some storage media.  +Or the XML might be stored in the file itself, as a machine readable table +of contents. +

2.7  Case 7:  Templates, Skeleton Files, etc. +

XML can be used as a medium for creating templates or skeletons for +HDF5 files.  For example, the skeleton of a data product could be +defined in XML, and read by software to produce the file and then fill +in the specific values.  This is a very useful tool for standardization.  +This is very similar to how the HCR tools for HDF-EOS worked.[citation] +

It might also be possible to have XML templates for parts of HDF5 files, +which can be composed to form datasets.  For instance, there could +be a library of XML templates for storing gridded data of various kinds, +which would be coordinated with software to efficiently store and retrieve +the data.  A user could compose a data product by selecting appropriate +templates to construct the dataset.  This could also provide code +modules to create and read the dataset. +

2.8  Implications +

These different use cases for XML require different (and sometmes conflicting) +information in the XML.  For instance, an XML catalog record is intended +to be a description of the dataset and its location.  This record +should be compact, and should have all the attributes, and a pointer to +the dataset at a data service, but the data should not be included.  +By contrast, an XML based validation tool needs to have a complete description +of the file, including the data (if present). +

The HDF5 DTD is designed to support many uses.  In some cases, +there are alternative descriptions provided, e.g., data in the file can +be represented by a pointer to the original file or by a description of +the data values themselves--or both. +

+3.  Main Components of the HDF5 DTD

+The HDF5 DTD is intended to describe the structure and contents of an HDF5 +file.  For the most part, the DTD closely follows the HDF5 data model, +as described in [4] and [2].  THe HDF5 data model +defines the shape and data types of datasets and attributes.  These +descriptions are similar to other general descriptions of scientific data +[ 5, 6, 7, 8, +11], +although HDF5 is more general than some these.  The description of +the HDF5 objects is discussed in Section 3.1. +

An important feature of the HDF5 data model is the Group structure, +which allows the HDF5 file to be structured as a rooted directed graph, +analogous +to a Unix file system.  In the HDF5 file objects may be shared, and +it is possible for objects to be a parent of their own ancestor( i.e., +the graph may have loops).  In other words, the structure of the HDF5 +file is not limited to be a tree.  In contrast, XML descriptions are +restricted to be a tree, so it was necessary to map the directed graph +of HDF5 onto a tree of XML elements.  This is discussed in Section +3.2. +

The XML standard does not define numeric types, nor representations +for arrays, tables, etc.  In the case where it is necessary to describe +actual data values (the value of an attribute, or values of an array), +there is no current standard to follow, so we were guided by the best practices +we could find.  Still, this is an area where our DTD must evolve in +the future.  These issues are discussed in Section 3.4. +

Finally, the DTD needs to support the ability to describe an HDF5 file +in detail.  This desribe must be able to include storage properties, +compression properties, and the like.  The DTD defines optional elements +for this information.  These are described in Section 3.4. +

3.1  Description of Datasets (Dataspace and Datatypes, and Attributes) +

The HDF5 data model provides a complete and well defined description +for most kinds of scientific data.  The DTD follows the HDF5 model +in a simple and clear way.  An HDF5 Dataset object is described +by and XML <Dataset> element; each <Dataset> has +a <Dataspace> and <Datatype> object, corresponding +to the HDF5 model. +

HDF5 has a very elaborate model of types, including arbitrary "compound +datatypes" (i.e., structured records with heterogeneous components) as +well as a completely general model of number representation.  Expressing +this in XML was easy, if somewhat elaborate.  It should be noted that +we made some seemingly arbitrary decisions about how to express the attributes +of a datatype:  sometimes an XML element is used and sometimes an +XML attribute is used. +

3.2  Description of the Structure (Groups) +

An HDF5 file is a rooted directed graph, with at least one Group, "/".  +Some files are very simple, containing a few datasets, all in the root +group.  Other files have elaborate grouping structures, organizing +the objects as a tree or graph.  Objects can be shared, i.e., they +can be members of more than one group.  In this case, the graph is +not a tree, because some objects have more than one parent.  It is +also possible for Groups to directly or indirectly contain an ancestor.  +In other words, the graph can have a loop in it. +

XML descriptions are trees, with exactly one root, and objects nested +in their parent.  XML has no concept of elements which have more than +one owner.  This raised the issue of how to map the graph structure +of the HDF5 file to a tree of XML elements. +

First, there is an issue of what is the desired relationship between +HDF5 objects and XML elements/objects.  It is clear that XML is general +enough to describe almost any structure.  For example, the "Resource +Description Framework" (RDF) can represent complex semantic networks.[10]  +So the issue is not a lack of expressive power in XML. +

The issue here is that standard XML software, e.g., SAX parsers and +the DOM, naturally create objects (data structures) which correspond to +the elements of the XML description.  To the degree that the objects +of HDF5 can be mapped to elements of XML, then general purpose XML-based +software will be presented with an approximation of the semantics of the +HDF5 objects, simply from the XML itself.  In other words, the HDF5 +objects are mapped naturally to XML elements, and general purpose XML tools +will approximately understand the structure of the HDF5. +

In this approach, the difficult problem is how to represent group membership.  +For a simple HDF5 file in which the objects are structured as a tree, then +the objects can be represetned as elements, and members of a group can +be nested in a <Group> element.  The XML nesting directly +expresses the HDF5 membership in a natural way.  But what should be +done to represent a more general graph, e.g., where a dataset is a member +of two dfferent groups? +

One possibility is to represent the struture of the file in a general +set notation, with a set of nodes (vertices) and a set of arcs (edges).  +Each dataset and group is a "node", and the membership is represented as +"arcs".  There are many variants of this basic approach, and it is +easy to develop that software can read these two sets and construct the +graph.  This sort of representation is very natural for many algorithms +that manipulate graphs, and can be easily transformed into different data +structures.  However, standard XML software would have no notion of +the meaning of the edges and vertices, nor any clue to the structure of +the file. +

These approaches can be combined, nesting the objects as in XML, with +a special "link" or cross reference to represent a second occurrence of +the same object.  This hybrid approach has the advantage that in simple +cases the structure of the XML closely follows the structure of the HDF5 +file, while capturing the complex cases when needed. +

After considering each alternative in detail, a hybrid approach was +chosen.   For HDF5 objects that may be shared (Groups, Datasets, +Named Datatypes) the XML element is defined to be either a description +of the object or a "pointer" to an element that describes the object.  +A shared object should be described in exactly one element, and all other +instances should point to that element. +

It should be noted that the XML parser can verify that the "pointer" +points to a valid XML element, but not that it points to the correct element, +nor that there is only one description of a given HDF5 object.  For +instance, XML can confirm that a "pointer" has a reference to exactly one +element (HDF5 object), but that object could be any valid XML element, +including the link itself or any of the other elements of DTD.  This +makes no sense according to the rules of HDF5, and those rules must be +enforced by the applications that create and use the XML description. +

3.3  The Data Values +

While representing metadata with XML was fairly straightforward, it +was less obvious what should be done with the data values.  For different +purposes, it may be better to: +

+A second design choice is whether to mark up the data elements or include +data as a single block of undelimited text.  For example, the values +of a two dimensional array could be included either as a single block of +values, or tagged with XML elements for each row, or tagged individually +for each row and column. +

Examination of existing practice shows that there is no outstanding +agreement on these issues.  This is not surprising, since the choice +depends on the requirements of the intended use.  Interesting examples +of related work include: +

+We wanted to support as many variations as possible, so our design allows +many representations (including omission) for data values. +

One point to note about the HDF5 DTD:  many of the other approaches +(e.g., XDF) include substantial metadata about the shape and type of arrays.  +This information is provided in great detail by the HDF5 metadata, so our +markup of the data values is less elaborate than some other DTDs.  +On the other hand, certain facts such as the order of the dimensions and +elements in the XML description  must still be included, because +the XML is not required to be laid out in the order that the HDF5 file +specified. +

We were not able to create a satisfactory markup for data in the HDF5 +file for the first release.  The initial version of the DTD has a +limited <Data> element, which does not support all the desired +features.  This will be revised in a future release. +

3.4  File Format Details +

The DTD must be able to support applications that need to fully describe +the details of a specific HDF5 file.  For example, in order to verify +the correctness of a specific dataset in an archive, it may be necessary +to confirm the storage layout and compression parameters are correct, as +well as the structure, attributes, and data values. +

For these applications, optional elements are included in the DTD, including: +

+These elements are only partly defined in the first release of the DTD. +

+References

+ +
    +
  1. +HDF5 DTD
  2. + +
    http://hdf.ncsa.uiuc.edu/HDF5/XML +
  3. +DDL in BNF for HDF5
  4. + +
    http://hdf.ncsa.uiuc.edu/HDF5/doc/ddl.html +
  5. +HDF5 Abstract Data Model
  6. + +
    http://hdf.ncsa.uiuc.edu/HDF5/ADM_990506/ +
  7. +VisAD http://www.ssec.wisc.edu/~billh/visad.html
  8. + +
  9. +XSIL: Extensible Scientific Interchange Language
  10. + +
    http://www.cacr.caltech.edu/SDA/xsil/ +
  11. +XDF (eXtensible Data Format)
  12. + +
    http://tarantella.gsfc.nasa.gov/xml/ +
  13. +netcdf
  14. + +
    http://hdf.ncsa.uiuc.edu/HDF5/XML/NetCDF/netcdf.dtd +
  15. +XML-data
  16. + +
    http://www.w3.org/TR/1998/NOTE-XML-data/ +
  17. +Resource Description Framework (RDF)
  18. + +
    http://www.w3.org/RDF/ +
  19. +Scientific Data Management (SDM)
  20. + +
    http://www-xdiv.lanl.gov/XCI/PROJECTS/SDM
+ + +