Variable-length Datatypes



                       VARIABLE-LENGTH DATATYPES IN HDF5
                            (a temporary document)


Variable-length Datatype Overview And Justification:
----------------------------------------------------
    Variable-length (VL) datatypes are sequences of an existing datatype
(atomic, VL or compound) which are not fixed in length from one dataset location
to another.  They are similar to C character strings in essence - a sequence of
a type which is pointed to by a particular type of "pointer", although they are
implemented more closely to FORTRAN strings by including an explicit length in
the "pointer" instead of using a particular value to terminate the sequence.

    VL datatypes are useful to the scientific community in many different ways,
some of which are listed below:
    - Ragged Arrays: Multi-dimensional ragged arrays can be implemented with
        the last (fastest changing) dimension being ragged by using a
        VL datatype as the type of the element stored. (Or as a field in a
        compound datatype)
    - Fractal Arrays: If a compound datatype has a VL field of another compound
        type with VL fields (a "nested" VL datatype), this can be used to
        implement ragged arrays of ragged arrays, to whatever nesting depth is
        required for the user.
    - Polygon Lists: A common storage requirement is to efficiently store arrays
        of polygons with different numbers of vertices.  VL datatypes can be
        used to efficiently and succinctly describe an array of polygons with
        different numbers of vertices.
    - Character Strings: Perhaps the most common use of VL datatypes will be to
        store C-like VL character strings in dataset elements or as attributes
        of objects.
    - Indices: An array of VL object references could be used as an index to
        all the objects in a file which contained a particular sequence of
        dataset values. Perhaps an array something like the following:
            Value1: Object1, Object3, Object9
            Value2: Object0, Object12, Object14, Object21, Object22
            Value3: Object2
            Value4: 
            Value5: Object1, Object10, Object12
                .
                .
    - Object Tracking: An array of VL dataset region references can be used as
        a method of tracking objects or features appearing in a sequence of
        datasets.  Perhaps an array of them would look like:
            Feature1: Dataset1:Region, Dataset3:Region, Dataset9:Region
            Feature2: Dataset0:Region, Dataset12:Region, Dataset14:Region,
                    Dataset21:Region, Dataset22:Region
            Feature3: Dataset2:Region
            Feature4: 
            Feature5: Dataset1:Region, Dataset10:Region, Dataset12:Region
                .
                .


Variable-length Datatype Memory Management:
-------------------------------------------
    With each element possibly being of different sequence lengths for a
dataset with a VL datatype, the memory for the VL datatype must be dynamically
allocated.  Currently there are two methods of managing the memory for VL
datatypes: the standard C malloc/free memory allocation routines or a method
of calling user-defined memory management routines to allocate or free memory.
Since the memory allocated when reading (or writing) may be complicated to
release, an HDF5 routine is provided to traverse a memory buffer and free the
VL datatype information without leaking memory.


Why Variable-length Datatypes Can't Be Divided:
-----------------------------------------------
    VL datatypes are designed so that they can not be subdivided by the library
with selections, etc.  This design was chosen due to the complexities in
specifying selections on each VL element of a dataset through a selection API
that is easy to understand.  Also, the selection APIs work on dataspaces, not
on datatypes.  At some point in time, we may want to create a way for
dataspaces to have VL components to them and we would need to allow selections
of those VL regions, but that is beyond the scope of this document.


What Happens If The Library Runs Out Of Memory While Reading?:
--------------------------------------------------------------
    It is possible for a call to H5Dread to fail while reading in VL datatype
information if the memory required exceeds that which is available.  In this
case, the H5Dread call will fail gracefully and any VL data which has been
allocated prior to the memory shortage will be returned to the system via the
memory management routines detailed below.  It may be possible to design a
"partial read" API function at a later date, if demand for such a function
warrants.


Strings as Variable-length Datatypes:
-------------------------------------
    Since character strings are a special case of VL data that is implemented
in many different ways on different machines and programming languages, they are
handled somewhat differently from other VL datatypes in HDF5.
    HDF5 has native VL strings for each language API, which are stored the
same way on disk, but are exported through each language API in a natural way
for that language.  When retrieving VL strings from a dataset, users may choose
to have them stored in memory as a native VL string or in HDF5's hvl_t struct
for VL datatypes.
    VL strings may be created in one of two ways: by creating a VL datatype with
a base type of H5T_NATIVE_ASCII, H5T_NATIVE_UNICODE, etc or by creating a string
datatype and setting it's length to H5T_STRING_VARIABLE.  The second method is
used to access native VL strings in memory.  The library will convert between
the two types, but they are stored on disk using different datatypes and have
different memory representations..
    Multi-byte character representations, such as UNICODE or "wide" characters
in C/C++, will need the appropriate character and string datatypes created
so that they can be described properly through the datatype API.  Additional
conversions between these types and the current ASCII characters will also be
required.
    Variable-width character strings (which might be compressed data or some
other encoding) are not currently handled by this design.  We will evaluate
how to implement them based on user's feedback.


Variable-length Datatype API:
-----------------------------
Creation:
    VL datatypes are created with the H5Tvlen_create() function as follows:
        type_id=H5Tvlen_create(hid_t base_type_id);
    The base datatype will be the datatype that the sequence is composed of,
characters for character strings, vertex coordinates for polygon lists, etc.
The base type specified for the VL datatype can be of any HDF5 datatype,
including another VL datatype, a compound datatype or an atomic datatype.


Query base type of VL datatype:
    It may be necessary to know the base type of a VL datatype before memory
is allocated, etc.  The base type is queried with the H5Tget_super()
function, described in the H5T documentation.


Query minimum memory required for VL information:
    It order to predict the memory usage that H5Dread may need to allocate to
store VL data while reading the data, the H5Dget_vlen_size() function is
provided, as follows:
        herr_t H5Dget_vlen_buf_size(hid_t dataset_id, hid_t type_id,
            hid_t space_id, hsize_t *size) 
        (This function is not implemented in Release 1.2.)
    This routine checks the number of bytes required to store the VL data from
the dataset, using the space_id for the selection in the dataset on disk and
the type_id for the memory representation of the VL data in memory.  The *size
value is modified according to how many bytes are required to store the VL data
in memory.


Specifying how to manage memory for the VL datatype:
    The memory management method is determined by dataset transfer properties
passed into the H5Dread/H5Dwrite with the dataset transfer property list.  
    Either the system malloc and free calls  or
user-defined substitutes can be used via the H5Pset_vlen_mem_manage_type()
call, as follows:
        herr_t H5Pset_vlen_mem_manage_type(hid_t plist_id, H5T_vlen_mem_t type)
    When user-defined memory management is chosen, allocation and free routines
must also be provided with the H5Pset_vlen_mem_manage_routines() API call, as
follows:
        herr_t H5Pset_vlen_mem_manager(hid_t plist_id,
            H5MM_allocate_t alloc, void *alloc_info,
            H5MM_free_t free, void *free_info)
    The prototypes for these functions look like:
            typedef void *(*H5MM_allocate_t)(size_t size,void *info) ;
            typedef void (*H5MM_free_t)(void *mem, void *free_info) ;
    The alloc_info and free_info parameters can be used to pass along any
required information to the user's memory management routines.


Recovering memory from VL buffers read in:
    The complex memory buffers created for a VL datatype may be reclaimed with
the H5Dvlen_reclaim() function call, as follows:
        herr_t H5Dvlen_reclaim(hid_t type_id, hid_t space_id, hid_t plist_id,
            void *buf);
    The type_id must be the datatype stored in the buffer, space_id describes
the selection for the memory buffer to free the VL datatypes within, plist_id
is the dataset transfer property list which was used for the I/O transfer to
create the buffer and buf is the pointer to the buffer to free the VL memory
within.  The VL structures (hvl_t's) in the users buffer are modified to zero
out the VL information after it's been freed.
    If "nested" VL datatypes were used to create the buffer, this routine frees
them from the "bottom" up, releasing all the memory without creating memory
leaks.


Code Examples:
--------------
For samples VL datatype code, see the tests in test/tvltypes.c in the 
HDF5 distribution.