[svn-r7450] Purpose:
Add document describing issues relating to variable-length datatypes.
This commit is contained in:
151
doc/html/TechNotes/VLTypes.html
Normal file
151
doc/html/TechNotes/VLTypes.html
Normal file
@@ -0,0 +1,151 @@
|
||||
<html>
|
||||
<head>
|
||||
<title>
|
||||
Variable-Length Datatypes in HDF5
|
||||
</title>
|
||||
|
||||
<STYLE TYPE="text/css">
|
||||
|
||||
P { text-indent: 2em}
|
||||
P.item { margin-left: 2em; text-indent: -2em}
|
||||
P.item2 { margin-left: 2em; text-indent: 2em}
|
||||
|
||||
TABLE.format { border:solid; border-collapse:collapse; caption-side:top; text-align:center; width:80%;}
|
||||
TABLE.format TH { border:ridge; padding:4px; width:25%;}
|
||||
TABLE.format TD { border:ridge; padding:4px; }
|
||||
TABLE.format CAPTION { font-weight:bold; font-size:larger;}
|
||||
|
||||
TABLE.note {border:none; text-align:right; width:80%;}
|
||||
|
||||
TABLE.desc { border:solid; border-collapse:collapse; caption-size:top; text-align:left; width:80%;}
|
||||
TABLE.desc TR { vertical-align:top;}
|
||||
TABLE.desc TH { border-style:ridge; font-size:larger; padding:4px; text-decoration:underline;}
|
||||
TABLE.desc TD { border-style:ridge; padding:4px; }
|
||||
TABLE.desc CAPTION { font-weight:bold; font-size:larger;}
|
||||
|
||||
TABLE.list { border:none; }
|
||||
TABLE.list TR { vertical-align:top;}
|
||||
TABLE.list TH { border:none; text-decoration:underline;}
|
||||
TABLE.list TD { border:none; }
|
||||
|
||||
</STYLE>
|
||||
|
||||
<!-- #BeginLibraryItem "/ed_libs/styles_Format.lbi" -->
|
||||
<!--
|
||||
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
|
||||
* Copyright by the Board of Trustees of the University of Illinois. *
|
||||
* All rights reserved. *
|
||||
* *
|
||||
* This file is part of HDF5. The full HDF5 copyright notice, including *
|
||||
* terms governing use, modification, and redistribution, is contained in *
|
||||
* the files COPYING and Copyright.html. COPYING can be found at the root *
|
||||
* of the source code distribution tree; Copyright.html can be found at the *
|
||||
* root level of an installed copy of the electronic HDF5 document set and *
|
||||
* is linked from the top-level documents page. It can also be found at *
|
||||
* http://hdf.ncsa.uiuc.edu/HDF5/doc/Copyright.html. If you do not have *
|
||||
* access to either file, you may request a copy from hdfhelp@ncsa.uiuc.edu. *
|
||||
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
|
||||
-->
|
||||
|
||||
<link href="ed_styles/FormatElect.css" rel="stylesheet" type="text/css">
|
||||
<!-- #EndLibraryItem -->
|
||||
</head>
|
||||
|
||||
<body bgcolor="#FFFFFF">
|
||||
<H3>Introduction</H3>
|
||||
<P>Variable-length (VL) datatypes have a great deal of flexibility, but can
|
||||
be over- or mis-used. VL datatypes are ideal at capturing the notion
|
||||
that elements in an HDF5 dataset (or attribute) can have different
|
||||
amounts of information (VL strings are the canonical example),
|
||||
but they have some drawbacks that this document attempts
|
||||
to address.
|
||||
</P>
|
||||
|
||||
<H3>Background</H3>
|
||||
<P>Because fast random access to dataset elements requires that each
|
||||
element be a fixed size, the information stored for VL datatype elements
|
||||
is actually information to locate the VL information, not
|
||||
the information itself.
|
||||
</P>
|
||||
|
||||
<H3>When to use VL datatypes</H3>
|
||||
<P>VL datatypes are designed allow the amount of data stored in each
|
||||
element of a dataset to vary. This change could be
|
||||
over time as new values, with different lengths, were written to the
|
||||
element. Or, the change can be over "space" - the dataset's space,
|
||||
with each element in the dataset having the same fundamental type, but
|
||||
different lengths. "Ragged arrays" are the classic example of elements
|
||||
that change over the "space" of the dataset. If the elements of a
|
||||
dataset are not going to change over "space" or time, a VL datatype
|
||||
should probably not be used.
|
||||
</P>
|
||||
|
||||
<H3>Access Time Penalty</H3>
|
||||
<P>Accessing VL information requires reading the element in the file, then
|
||||
using that element's location information to retrieve the VL
|
||||
information itself.
|
||||
In the worst case, this obviously doubles the number of disk accesses
|
||||
required to access the VL information.
|
||||
</P>
|
||||
<P>However, in order to avoid this extra disk access overhead, the HDF5
|
||||
library groups VL information together into larger blocks on disk and
|
||||
performs I/O only on those larger blocks. Additionally, these blocks of
|
||||
information are cached in memory as long as possible. For most access
|
||||
patterns, this amortizes the extra disk accesses over enough pieces of
|
||||
VL information to hide the extra overhead involved.
|
||||
</P>
|
||||
|
||||
<H3>Storage Space Penalty</H3>
|
||||
<P>Because VL information must be located and retrieved from another
|
||||
location in the file, extra information must be stored in the file to
|
||||
locate
|
||||
each item of VL information (i.e. each element in a dataset or each
|
||||
VL field in a compound datatype, etc.).
|
||||
Currently, that extra information amounts to 32 bytes per VL item.
|
||||
</P>
|
||||
<P>
|
||||
With some judicious re-architecting of the library and file format,
|
||||
this could be reduced to 18 bytes per VL item with no loss in
|
||||
functionality or additional time penalties. With some additional
|
||||
effort, the space could perhaps could be pushed down as low as 8-10
|
||||
bytes per VL item with no loss in functionality, but potentially a
|
||||
small time penalty.
|
||||
</P>
|
||||
|
||||
<H3>Chunking and Filters</H3>
|
||||
<P>Storing data as VL information has some affects on chunked storage and
|
||||
the filters that can be applied to chunked data. Because the data that
|
||||
is stored in each chunk is the location to access the VL information,
|
||||
the actual VL information is not broken up into chunks in the same way
|
||||
as other data stored in chunks. Additionally, because the
|
||||
actual VL information is not stored in the chunk, any filters which
|
||||
operate on a chunk will operate on the information to
|
||||
locate the VL information, not the VL information itself.
|
||||
</P>
|
||||
|
||||
<H3>File Drivers</H3>
|
||||
<P>Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't
|
||||
allow objects with varying sizes to be created in the file, attemping
|
||||
to create
|
||||
a dataset or attribute with a VL datatype in a file managed by those
|
||||
drivers will cause the creation call to fail.
|
||||
</P>
|
||||
<P>Additionally, using
|
||||
VL datatypes and the 'multi' and 'split' file drivers may not operate
|
||||
in the manner desired. The HDF5 library currently categorizes the
|
||||
"blocks of VL information" stored in the file as a type of metadata,
|
||||
which means that they may not be stored with the other raw data for
|
||||
the file.
|
||||
</P>
|
||||
|
||||
<H3>Rewriting</H3>
|
||||
<P>When VL information in the file is re-written, the old VL information
|
||||
must be releases, space for the new VL information allocated and
|
||||
the new VL information must be written to the file. This may cause
|
||||
additional I/O accesses.
|
||||
</P>
|
||||
|
||||
</body>
|
||||
|
||||
</html>
|
||||
|
||||
Reference in New Issue
Block a user