diff --git a/doc/html/XML_DTD/DesignNotes.html b/doc/html/XML_DTD/DesignNotes.html new file mode 100755 index 0000000000..12d587dd89 --- /dev/null +++ b/doc/html/XML_DTD/DesignNotes.html @@ -0,0 +1,419 @@ + + +
+ + + + + +This document discusses some of the key features of the HDF5 DTD, and +some of the design decisions that were considered during its development. +
The HDF5 data model is somewhat complex, with a great deal of flexibility +and expressive power. The DTD is intended to be able to describe +almost any HDF5 file, and to describe most of the details of the file. +For these reasons, the HDF5 DTD is more complicated than some similar DTDs, +such as XDF[7] and netCDF[8]. +
The DTD to some extent is redundant with the previously published "Data +Definition Language", and the accompanying "h5dump" and related tools.[2] +XML descriptions will contain similar or identical information as the dumper +DDL. The important difference is that the XML is machine readable, +but not necessarily human readable, and XML is a standard format that can +be exchanged with standard tools and other XML languages. +
The DTD defines a formal and machine verifiable syntax which +is rigorously enforced by validating XML tools. This guarantees that +the producer and consume can exchange the description. The rules +of XML will guarantee that the description is syntactically correct and +follows the grammar defined in the DTD. However, XML cannot assure +that a particular XML description is a correct description of the HDF5 +file, or even that it follows all the semantic rules of HDF5. For +example, the XML descritpion can assure that every Dataset element belongs +to at least one enclosing Group element, but can't assure that the Dataset +is in the correct Group, or that the Dataset has the correct name, type, +etc. The overall correctness of the XML description must be assured +by tools that generate the XML. +
2.1 Case 1: Viewing Structure and Contents of HDF5 File +Using a Web Browser +
XML descriptions of HDF5 files will be readable by standard Web browsers. +Some standard Web browsers will be able to display the XML directly, and +many servers will be able to generate HTML from XML on the fly. +We will also be able to construct style sheets to control the rendering +of information about HDF5 files. +
This use of XML will make it easy to get at least a general view of +the contents of an HDF5 file without any special software. +
2.2 Case 2: XML as a Catalog Record +
One use of XML will be for catalog records, e.g., at a NASA DAAC or +similar archive. The contents of HDF-5 files will be described in +XML records which will be stored in a database or otherwise served through +search services. The XML will be delivered to clients or proxies +without the original HDF-5 file. The client will use the records +to locate and obtain the datasets they want. +
In this use, the XML is likely to be used separately from the original +HDF-5 file, and any 'pointers' to the file must have complete URLs and +other information, in order to locate the actual dataset. +
In general, the purpose of these records is not to deliver all the data, +nor to reconstruct the contents of the file from the XML. However, +the content of the attributes is likely to be vital. Also, +there may be a desire for the records to be comparatively compact, as you +might be searching thousands of candidate datasets, and hence might receive +thousands of XML descriptions. +
It is difficult to know what kinds of searches will be required. +However, it seems likely that details such as format version and storage +strategy are less likely to be of great interest, compared to searches +on the attributes. +
2.3 Case 3: XML as an Intermediate Form for Programs +
A second case is using XML as a machine readable description of HDF-5 +files while manipulating the file itself. An HDF5 editor is an example +of this kind of use. Here, the XML and HDF-5 file are both accessible +to the program. +
It is difficult to generalize what applications might intend to do. +For purposes of this use case, assume the following: +
The XML description of the file is used because the application +wants to use standard XML tools and interfaces to manipulate the objects. +For instance, standard packages will read XML into DOM objects, which provides +not only standard data structures for the XML objects, but also standard +interfaces for manipulating the tree (insert, delete, etc.). There +are already standard editors for XML trees.+ +
The notion of this use case is that you would build tools for HDF-5
+by extending such standard XML functions. The main trick will be
+to keep the XML and the HDF-5 correlated. When an XML object is created
+or changed, the tool will perform the equivalent operation on the file.
+
Thus, in this case, it is very important for the XML to be closely related +to the structure of the HDF-5 file, and that this be maintained. +This contrasts to Case 2 (a catalog) where the XML might be generated once, +and possibly used many times without ever accessing the file. Also, +we definitely want the XML to point to the objects in the file, whether +the data is in the XML or not. +
Depending on the application, many details of the file may be needed, +certainly including things like storage strategy. However, since +the file is available, these things can be obtained from the file rather +than XML. This means that much of the detail could be optional for +the XML. +
2.4 Case 4: Generation, Validation, and Reconstruction +of HDF-5 +
A third case for using XML is as a tool for validating, comparing, or +generating HDF-5 files. We have proposed tools for checking, correcting, +and diff-ing HDF-5 files, which might use XML as a canonical description +of the file. Similarly, an 'h5gen' utility might well use XML as +the template to create HDF-5 files. +
These applications need to be able to represent essentially everything +about the HDF-5 file. In the case of a validator or diff-er, even +boot block information is important. +
Also, it may well be the case that the data must be included in the +XML, either because the HDF-5 file is not available, or because it must +be arranged in a canonical form for comparison, e.g., to confirm that two +files have the same contents. +
While it is necessary for everything (or "everything important") to +be in the XML, it is not necessary that the XML representation itself follows +all of the rules of HDF-5. For instance, it is not required that +the XML objects are in the same order as the HDF-5 objects (if such can +even be determined), or that storage offsets in th eHDF5 file are faithfully +represented in the XML. +
2.5 Case 5: XML as Intermediate to Other Formal Languages +and File Formats +
XML is ideally suited for automatic transformation into various formal +languages, either directly or via additional XML languages. For example, +an XML description of an HDF5 file could be transformed into ODL.[citataion?] +Similarly, XML can be transformed to other XML languages, such as XDF[7]. +
XML may also be a good intermediate language for translating between +file formats. For example, the XML description of HDF5 could be transformed +into the XML description for netCDF, and then the data could be written +as netCDF. +
It is likely that there will be "hub" languages, such as XDF, that are +very general languages for data. Translating from HDF5-XML to XDF +will lose information, but will then make the data translatable to any +other format that can be mapped to XDF. Similarly, data could be +imported to HDF5 from any format that can be translated to XDF, albeit +with some loss of information. +
It should also be noted that an XML description of HDF5 could be used +to transform or translate individual objects from a file. For example, +an HDF5 file might contain several datasets, one of which can be mapped +to an OGIS gridded map. In this case, software could read the XML, +locate the datasets that can be handled, and translate them to OGIS XML +or other OGIS representations. In this way, similar kinds of data +can be made to work together regardless of storage format, and without +requiring that the entire file be limited to a particular kind or format +of data. This would be a very powerful tool for sharing data. +
2.6 Case 6: Store XML in Archive or in Dataset as Machine +Readable Documentation +
The XML description of an HDF5 file is a promising candidate to be a +machine readable format to be stored in archives. The XML would likely +be interpretable in the future, and could be mapped to whatever technology +is available. +
In this scenario, the XML should contain sufficient information to access +and translate the data if necessary. +
One variation of this theme is to store the descriptions of the files +in a repository, while the files may reside in some storage media. +Or the XML might be stored in the file itself, as a machine readable table +of contents. +
2.7 Case 7: Templates, Skeleton Files, etc. +
XML can be used as a medium for creating templates or skeletons for +HDF5 files. For example, the skeleton of a data product could be +defined in XML, and read by software to produce the file and then fill +in the specific values. This is a very useful tool for standardization. +This is very similar to how the HCR tools for HDF-EOS worked.[citation] +
It might also be possible to have XML templates for parts of HDF5 files, +which can be composed to form datasets. For instance, there could +be a library of XML templates for storing gridded data of various kinds, +which would be coordinated with software to efficiently store and retrieve +the data. A user could compose a data product by selecting appropriate +templates to construct the dataset. This could also provide code +modules to create and read the dataset. +
2.8 Implications +
These different use cases for XML require different (and sometmes conflicting) +information in the XML. For instance, an XML catalog record is intended +to be a description of the dataset and its location. This record +should be compact, and should have all the attributes, and a pointer to +the dataset at a data service, but the data should not be included. +By contrast, an XML based validation tool needs to have a complete description +of the file, including the data (if present). +
The HDF5 DTD is designed to support many uses. In some cases, +there are alternative descriptions provided, e.g., data in the file can +be represented by a pointer to the original file or by a description of +the data values themselves--or both. +
An important feature of the HDF5 data model is the Group structure, +which allows the HDF5 file to be structured as a rooted directed graph, +analogous +to a Unix file system. In the HDF5 file objects may be shared, and +it is possible for objects to be a parent of their own ancestor( i.e., +the graph may have loops). In other words, the structure of the HDF5 +file is not limited to be a tree. In contrast, XML descriptions are +restricted to be a tree, so it was necessary to map the directed graph +of HDF5 onto a tree of XML elements. This is discussed in Section +3.2. +
The XML standard does not define numeric types, nor representations +for arrays, tables, etc. In the case where it is necessary to describe +actual data values (the value of an attribute, or values of an array), +there is no current standard to follow, so we were guided by the best practices +we could find. Still, this is an area where our DTD must evolve in +the future. These issues are discussed in Section 3.4. +
Finally, the DTD needs to support the ability to describe an HDF5 file +in detail. This desribe must be able to include storage properties, +compression properties, and the like. The DTD defines optional elements +for this information. These are described in Section 3.4. +
3.1 Description of Datasets (Dataspace and Datatypes, and Attributes) +
The HDF5 data model provides a complete and well defined description +for most kinds of scientific data. The DTD follows the HDF5 model +in a simple and clear way. An HDF5 Dataset object is described +by and XML <Dataset> element; each <Dataset> has +a <Dataspace> and <Datatype> object, corresponding +to the HDF5 model. +
HDF5 has a very elaborate model of types, including arbitrary "compound +datatypes" (i.e., structured records with heterogeneous components) as +well as a completely general model of number representation. Expressing +this in XML was easy, if somewhat elaborate. It should be noted that +we made some seemingly arbitrary decisions about how to express the attributes +of a datatype: sometimes an XML element is used and sometimes an +XML attribute is used. +
3.2 Description of the Structure (Groups) +
An HDF5 file is a rooted directed graph, with at least one Group, "/". +Some files are very simple, containing a few datasets, all in the root +group. Other files have elaborate grouping structures, organizing +the objects as a tree or graph. Objects can be shared, i.e., they +can be members of more than one group. In this case, the graph is +not a tree, because some objects have more than one parent. It is +also possible for Groups to directly or indirectly contain an ancestor. +In other words, the graph can have a loop in it. +
XML descriptions are trees, with exactly one root, and objects nested +in their parent. XML has no concept of elements which have more than +one owner. This raised the issue of how to map the graph structure +of the HDF5 file to a tree of XML elements. +
First, there is an issue of what is the desired relationship between +HDF5 objects and XML elements/objects. It is clear that XML is general +enough to describe almost any structure. For example, the "Resource +Description Framework" (RDF) can represent complex semantic networks.[10] +So the issue is not a lack of expressive power in XML. +
The issue here is that standard XML software, e.g., SAX parsers and +the DOM, naturally create objects (data structures) which correspond to +the elements of the XML description. To the degree that the objects +of HDF5 can be mapped to elements of XML, then general purpose XML-based +software will be presented with an approximation of the semantics of the +HDF5 objects, simply from the XML itself. In other words, the HDF5 +objects are mapped naturally to XML elements, and general purpose XML tools +will approximately understand the structure of the HDF5. +
In this approach, the difficult problem is how to represent group membership. +For a simple HDF5 file in which the objects are structured as a tree, then +the objects can be represetned as elements, and members of a group can +be nested in a <Group> element. The XML nesting directly +expresses the HDF5 membership in a natural way. But what should be +done to represent a more general graph, e.g., where a dataset is a member +of two dfferent groups? +
One possibility is to represent the struture of the file in a general +set notation, with a set of nodes (vertices) and a set of arcs (edges). +Each dataset and group is a "node", and the membership is represented as +"arcs". There are many variants of this basic approach, and it is +easy to develop that software can read these two sets and construct the +graph. This sort of representation is very natural for many algorithms +that manipulate graphs, and can be easily transformed into different data +structures. However, standard XML software would have no notion of +the meaning of the edges and vertices, nor any clue to the structure of +the file. +
These approaches can be combined, nesting the objects as in XML, with +a special "link" or cross reference to represent a second occurrence of +the same object. This hybrid approach has the advantage that in simple +cases the structure of the XML closely follows the structure of the HDF5 +file, while capturing the complex cases when needed. +
After considering each alternative in detail, a hybrid approach was +chosen. For HDF5 objects that may be shared (Groups, Datasets, +Named Datatypes) the XML element is defined to be either a description +of the object or a "pointer" to an element that describes the object. +A shared object should be described in exactly one element, and all other +instances should point to that element. +
It should be noted that the XML parser can verify that the "pointer" +points to a valid XML element, but not that it points to the correct element, +nor that there is only one description of a given HDF5 object. For +instance, XML can confirm that a "pointer" has a reference to exactly one +element (HDF5 object), but that object could be any valid XML element, +including the link itself or any of the other elements of DTD. This +makes no sense according to the rules of HDF5, and those rules must be +enforced by the applications that create and use the XML description. +
3.3 The Data Values +
While representing metadata with XML was fairly straightforward, it +was less obvious what should be done with the data values. For different +purposes, it may be better to: +
Examination of existing practice shows that there is no outstanding +agreement on these issues. This is not surprising, since the choice +depends on the requirements of the intended use. Interesting examples +of related work include: +
+We wanted to support as many variations as possible, so our design allows +many representations (including omission) for data values. +One point to note about the HDF5 DTD: many of the other approaches +(e.g., XDF) include substantial metadata about the shape and type of arrays. +This information is provided in great detail by the HDF5 metadata, so our +markup of the data values is less elaborate than some other DTDs. +On the other hand, certain facts such as the order of the dimensions and +elements in the XML description must still be included, because +the XML is not required to be laid out in the order that the HDF5 file +specified. +
We were not able to create a satisfactory markup for data in the HDF5 +file for the first release. The initial version of the DTD has a +limited <Data> element, which does not support all the desired +features. This will be revised in a future release. +
3.4 File Format Details +
The DTD must be able to support applications that need to fully describe +the details of a specific HDF5 file. For example, in order to verify +the correctness of a specific dataset in an archive, it may be necessary +to confirm the storage layout and compression parameters are correct, as +well as the structure, attributes, and data values. +
For these applications, optional elements are included in the DTD, including: +