[svn-r467] Restructuring documentation.
This commit is contained in:
409
doc/html/Compression.html
Normal file
409
doc/html/Compression.html
Normal file
@@ -0,0 +1,409 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
|
||||
<html>
|
||||
<head>
|
||||
<title>Compression</title>
|
||||
</head>
|
||||
|
||||
<body>
|
||||
<h1>Compression</h1>
|
||||
|
||||
<h2>1. Introduction</h2>
|
||||
|
||||
<p>HDF5 supports compression of raw data by compression methods
|
||||
built into the library or defined by an application. A
|
||||
compression method is associated with a dataset when the dataset
|
||||
is created and is applied independently to each storage chunk of
|
||||
the dataset.
|
||||
|
||||
The dataset must use the <code>H5D_CHUNKED</code> storage
|
||||
layout. The library doesn't support compression for contiguous
|
||||
datasets because of the difficulty of implementing random access
|
||||
for partial I/O, and compact dataset compression is not
|
||||
supported because it wouldn't produce significant results.
|
||||
|
||||
<h2>2. Supported Compression Methods</h2>
|
||||
|
||||
<p>The library identifies compression methods with small
|
||||
integers, with values less than 16 reserved for use by NCSA and
|
||||
values between 16 and 255 (inclusive) available for general
|
||||
use. This range may be extended in the future if it proves to
|
||||
be too small.
|
||||
|
||||
<p>
|
||||
<center>
|
||||
<table align=center width="80%">
|
||||
<tr>
|
||||
<th width="30%">Method Name</th>
|
||||
<th width="70%">Description</th>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td><code>H5Z_NONE</code></td>
|
||||
<td>The default is to not use compression. Specifying
|
||||
<code>H5Z_NONE</code> as the compression method results
|
||||
in better perfomance than writing a function that just
|
||||
copies data because the library's I/O pipeline
|
||||
recognizes this method and is able to short circuit
|
||||
parts of the pipeline.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td><code>H5Z_DEFLATE</code></td>
|
||||
<td>The <em>deflate</em> method is the algorithm used by
|
||||
the GNU <code>gzip</code>program. It's a combination of
|
||||
a Huffman encoding followed by a 1977 Lempel-Ziv (LZ77)
|
||||
dictionary encoding. The aggressiveness of the
|
||||
compression can be controlled by passing an integer value
|
||||
to the compressor with <code>H5Pset_deflate()</code>
|
||||
(see below). In order for this compression method to be
|
||||
used, the HDF5 library must be configured and compiled
|
||||
in the presence of the GNU zlib version 1.1.2 or
|
||||
later.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td><code>H5Z_RES_<em>N</em></code></td>
|
||||
<td>These compression methods (where <em>N</em> is in the
|
||||
range two through 15, inclusive) are reserved by NCSA
|
||||
for future use.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Values of <em>N</em> between 16 and 255, inclusive</td>
|
||||
<td>These values can be used to represent application-defined
|
||||
compression methods. We recommend that methods under
|
||||
testing should be in the high range and when a method is
|
||||
about to be published it should be given a number near
|
||||
the low end of the range (or even below 16). Publishing
|
||||
the compression method and its numeric ID will make a
|
||||
file sharable.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</center>
|
||||
|
||||
<p>Setting the compression for a dataset to a method which was
|
||||
not compiled into the library and/or not registered by the
|
||||
application is allowed, but writing to such a dataset will
|
||||
silently <em>not</em> compress the data. Reading a compressed
|
||||
dataset for a method which is not available will result in
|
||||
errors (specifically, <code>H5Dread()</code> will return a
|
||||
negative value). The errors will be displayed in the
|
||||
compression statistics if the library was compiled with
|
||||
debugging turned on for the "z" package. See the
|
||||
section on diagnostics below for more details.
|
||||
|
||||
<h2>3. Application-Defined Methods</h2>
|
||||
|
||||
<p>Compression methods 16 through 255 can be defined by an
|
||||
application. As mentioned above, methods that have not been
|
||||
released should use high numbers in that range while methods
|
||||
that have been published will be assigned an official number in
|
||||
the low region of the range (possibly less than 16). Users
|
||||
should be aware that using unpublished compression methods
|
||||
results in unsharable files.
|
||||
|
||||
<p>A compression method has two halves: one have handles
|
||||
compression and the other half handles uncompression. The
|
||||
halves are implemented as functions
|
||||
<code><em>method</em>_c</code> and
|
||||
<code><em>method</em>_u</code> respectively. One should not use
|
||||
the names <code>compress</code> or <code>uncompress</code> since
|
||||
they are likely to conflict with other compression libraries
|
||||
(like the GNU zlib).
|
||||
|
||||
<p>Both the <code><em>method</em>_c</code> and
|
||||
<code><em>method</em>_u</code> functions take the same arguments
|
||||
and return the same values. They are defined with the type:
|
||||
|
||||
<dl>
|
||||
<dt><code>typedef size_t (*H5Z_func_t)(unsigned int
|
||||
<em>flags</em>, size_t <em>cd_size</em>, const void
|
||||
*<em>client_data</em>, size_t <em>src_nbytes</em>, const
|
||||
void *<em>src</em>, size_t <em>dst_nbytes</em>, void
|
||||
*<em>dst</em>/*out*/)</code>
|
||||
<dd>The <em>flags</em> are an 8-bit vector which is stored in
|
||||
the file and which is defined when the compression method is
|
||||
defined. The <em>client_data</em> is a pointer to
|
||||
<em>cd_size</em> bytes of configuration data which is also
|
||||
stored in the file. The function compresses or uncompresses
|
||||
<em>src_nbytes</em> from the source buffer <em>src</em> into
|
||||
at most <em>dst_nbytes</em> of the result buffer <em>dst</em>.
|
||||
The function returns the number of bytes written to the result
|
||||
buffer or zero if an error occurs. But if a result buffer
|
||||
overrun occurs the function should return a value at least as
|
||||
large as <em>dst_size</em> (the uncompressor will see an
|
||||
overrun only for corrupt data).
|
||||
</dl>
|
||||
|
||||
<p>The application associates the pair of functions with a name
|
||||
and a method number by calling <code>H5Zregister()</code>. This
|
||||
function can also be used to remove a compression method from
|
||||
the library by supplying null pointers for the functions.
|
||||
|
||||
<dl>
|
||||
<dt><code>herr_t H5Zregister (H5Z_method_t <em>method</em>,
|
||||
const char *<em>name</em>, H5Z_func_t <em>method_c</em>,
|
||||
H5Z_func_t <em>method_u</em>)</code>
|
||||
<dd>The pair of functions to be used for compression
|
||||
(<em>method_c</em>) and uncompression (<em>method_u</em>) are
|
||||
associated with a short <em>name</em> used for debugging and a
|
||||
<em>method</em> number in the range 16 through 255. This
|
||||
function can be called as often as desired for a particular
|
||||
compression method with each call replacing the information
|
||||
stored by the previous call. Sometimes it's convenient to
|
||||
supply only one half of the compression, for instance in an
|
||||
application that opens files for read-only. Compression
|
||||
statistics for the method are accumulated across calls to this
|
||||
function.
|
||||
</dl>
|
||||
|
||||
<p>
|
||||
<center>
|
||||
<table border align=center width="100%">
|
||||
<caption align=bottom><h4>Example: Registering an
|
||||
Application-Defined Compression Method</h4></caption>
|
||||
<tr>
|
||||
<td>
|
||||
<p>Here's a simple-minded "compression" method
|
||||
that just copies the input value to the output. It's
|
||||
similar to the <code>H5Z_NONE</code> method but
|
||||
slower. Compression and uncompression are performed
|
||||
by the same function.
|
||||
|
||||
<p><code><pre>
|
||||
size_t
|
||||
bogus (unsigned int flags,
|
||||
size_t cd_size, const void *client_data,
|
||||
size_t src_nbytes, const void *src,
|
||||
size_t dst_nbytes, void *dst/*out*/)
|
||||
{
|
||||
memcpy (dst, src, src_nbytes);
|
||||
return src_nbytes;
|
||||
}
|
||||
</pre></code>
|
||||
|
||||
<p>The function could be registered as method 250 as
|
||||
follows:
|
||||
|
||||
<p><code><pre>
|
||||
#define H5Z_BOGUS 250
|
||||
H5Zregister (H5Z_BOGUS, "bogus", bogus, bogus);
|
||||
</pre></code>
|
||||
|
||||
<p>The function can be unregistered by saying:
|
||||
|
||||
<p><code><pre>
|
||||
H5Zregister (H5Z_BUGUS, "bogus", NULL, NULL);
|
||||
</pre></code>
|
||||
|
||||
<p>Notice that we kept the name "bogus" even
|
||||
though we unregistered the functions that perform the
|
||||
compression and uncompression. This makes compression
|
||||
statistics more understandable when they're printed.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</center>
|
||||
|
||||
<h2>4. Enabling Compression for a Dataset</h2>
|
||||
|
||||
<p>If a dataset is to be compressed then the compression
|
||||
information must be specified when the dataset is created since
|
||||
once a dataset is created compression parameters cannot be
|
||||
adjusted. The compression is specified through the dataset
|
||||
creation property list (see <code>H5Pcreate()</code>).
|
||||
|
||||
<dl>
|
||||
<dt><code>herr_t H5Pset_deflate (hid_t <em>plist</em>, int
|
||||
<em>level</em>)</code>
|
||||
<dd>The compression method for dataset creation property list
|
||||
<em>plist</em> is set to <code>H5Z_DEFLATE</code> and the
|
||||
aggression level is set to <em>level</em>. The <em>level</em>
|
||||
must be a value between one and nine, inclusive, where one
|
||||
indicates no (but fast) compression and nine is aggressive
|
||||
compression.
|
||||
|
||||
<br><br>
|
||||
<dt><code>int H5Pget_deflate (hid_t <em>plist</em>)</code>
|
||||
<dd>If dataset creation property list <em>plist</em> is set to
|
||||
use <code>H5Z_DEFLATE</code> compression then this function
|
||||
will return the aggression level, an integer between one and
|
||||
nine inclusive. If <em>plist</em> isn't a valid dataset
|
||||
creation property list or it isn't set to use the deflate
|
||||
method then a negative value is returned.
|
||||
|
||||
<br><br>
|
||||
<dt><code>herr_t H5Pset_compression (hid_t <em>plist</em>,
|
||||
H5Z_method_t <em>method</em>, unsigned int <em>flags</em>,
|
||||
size_t <em>cd_size</em>, const void *<em>client_data</em>)</code>
|
||||
<dd>This is a catch-all function for defining compresion methods
|
||||
and is intended to be called from a wrapper such as
|
||||
<code>H5Pset_deflate()</code>. The dataset creation property
|
||||
list <em>plist</em> is adjusted to use the specified
|
||||
compression method. The <em>flags</em> is an 8-bit vector
|
||||
which is stored in the file as part of the compression message
|
||||
and passed to the compress and uncompress functions. The
|
||||
<em>client_data</em> is a byte array of length
|
||||
<em>cd_size</em> which is copied to the file and passed to the
|
||||
compress and uncompress methods.
|
||||
|
||||
<br><br>
|
||||
<dt><code>H5Z_method_t H5Pget_compression (hid_t <em>plist</em>,
|
||||
unsigned int *<em>flags</em>, size_t *<em>cd_size</em>, void
|
||||
*<em>client_data</em>)</code>
|
||||
<dd>This is a catch-all function for querying the compression
|
||||
method associated with dataset creation property list
|
||||
<em>plist</em> and is intended to be called from a wrapper
|
||||
function such as <code>H5Pget_deflate()</code>. The
|
||||
compression method (or a negative value on error) is returned
|
||||
by value, and compression flags and client data is returned by
|
||||
argument. The application should allocate the
|
||||
<em>client_data</em> and pass its size as the
|
||||
<em>cd_size</em>. On return, <em>cd_size</em> will contain
|
||||
the actual size of the client data. If <em>client_data</em>
|
||||
is not large enough to hold the entire client data then
|
||||
<em>cd_size</em> bytes are copied into <em>client_data</em>
|
||||
and <em>cd_size</em> is set to the total size of the client
|
||||
data, a value larger than the original.
|
||||
</dl>
|
||||
|
||||
<p>It is possible to set the compression to a method which hasn't
|
||||
been defined with <code>H5Zregister()</code> and which isn't
|
||||
supported as a predefined method (for instance, setting the
|
||||
method to <code>H5Z_DEFLATE</code> when the GNU zlib isn't
|
||||
available). If that happens then data will be written to the
|
||||
file in its uncompressed form and the compression statistics
|
||||
will show failures for the compression.
|
||||
|
||||
<p>
|
||||
<center>
|
||||
<table border align=center width="100%">
|
||||
<caption align=bottom><h4>Example: Statistics for an
|
||||
Unsupported Compression Method</h4></caption>
|
||||
<tr>
|
||||
<td>
|
||||
<p>If an application attempts to use an unsupported
|
||||
method then the compression statistics will show large
|
||||
numbers of compression errors and no data
|
||||
uncompressed.
|
||||
|
||||
<p><code><pre>
|
||||
H5Z: compression statistics accumulated over life of library:
|
||||
Method Total Overrun Errors User System Elapsed Bandwidth
|
||||
------ ----- ------- ------ ---- ------ ------- ---------
|
||||
deflate-c 160000 0 160000 0.00 0.01 0.01 1.884e+07
|
||||
deflate-u 0 0 0 0.00 0.00 0.00 NaN
|
||||
</pre></code>
|
||||
|
||||
<p>This example is from a program that tried to use
|
||||
<code>H5Z_DEFLATE</code> on a system that didn't have
|
||||
the GNU zlib to write to a dataset and then read the
|
||||
result. The read and write both succeeded but the
|
||||
data was not compressed.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</center>
|
||||
|
||||
<h2>5. Compression Diagnostics</h2>
|
||||
|
||||
<p>If the library is compiled with debugging turned on for the H5Z
|
||||
layer (usually as a result of <code>configure --enable-debug=z</code>)
|
||||
then statistics about data compression are printed when the
|
||||
application exits normally or the library is closed. The
|
||||
statistics are written to the standard error stream and include
|
||||
two lines for each compression method that was used: the first
|
||||
line shows compression statistics while the second shows
|
||||
uncompression statistics. The following fields are displayed:
|
||||
|
||||
<p>
|
||||
<center>
|
||||
<table align=center width="80%">
|
||||
<tr>
|
||||
<th width="30%">Field Name</th>
|
||||
<th width="70%">Description</th>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Method</td>
|
||||
<td>This is the name of the method as defined with
|
||||
<code>H5Zregister()</code> with the letters
|
||||
"-c" or "-u" appended to indicate
|
||||
compression or uncompression.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Total</td>
|
||||
<td>The total number of bytes compressed or decompressed
|
||||
including buffer overruns and errors. Bytes of
|
||||
non-compressed data are counted.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Overrun</td>
|
||||
<td>During compression, if the algorithm causes the result
|
||||
to be at least as large as the input then a buffer
|
||||
overrun error occurs. This field shows the total number
|
||||
of bytes from the Total column which can be attributed to
|
||||
overruns. Overruns for decompression can only happen if
|
||||
the data has been corrupted in some way and will result
|
||||
in failure of <code>H5Dread()</code>.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Errors</td>
|
||||
<td>If an error occurs during compression the data is
|
||||
stored in it's uncompressed form; and an error during
|
||||
uncompression causes <code>H5Dread()</code> to return
|
||||
failure. This field shows the number of bytes of the
|
||||
Total column which can be attributed to errors.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>User, System, Elapsed</td>
|
||||
<td>These are the amount of user time, system time, and
|
||||
elapsed time in seconds spent by the library to perform
|
||||
compression. Elapsed time is sensitive to system
|
||||
load. These times may be zero on operating systems that
|
||||
don't support the required operations.</td>
|
||||
</tr>
|
||||
|
||||
<tr valign=top>
|
||||
<td>Bandwidth</td>
|
||||
<td>This is the compression bandwidth which is the total
|
||||
number of bytes divided by elapsed time. Since elapsed
|
||||
time is subject to system load the bandwidth numbers
|
||||
cannot always be trusted. Furthermore, the bandwidth
|
||||
includes overrun and error bytes which may significanly
|
||||
taint the value.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</center>
|
||||
|
||||
<p>
|
||||
<center>
|
||||
<table border align=center width="100%">
|
||||
<caption align=bottom><h4>Example: Compression
|
||||
Statistics</h4></caption>
|
||||
<tr>
|
||||
<td>
|
||||
<p><code><pre>
|
||||
H5Z: compression statistics accumulated over life of library:
|
||||
Method Total Overrun Errors User System Elapsed Bandwidth
|
||||
------ ----- ------- ------ ---- ------ ------- ---------
|
||||
deflate-c 160000 200 0 0.62 0.74 1.33 1.204e+05
|
||||
deflate-u 120000 0 0 0.11 0.00 0.12 9.885e+05
|
||||
</pre></code>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</center>
|
||||
|
||||
<hr>
|
||||
<address><a href="mailto:matzke@llnl.gov">Robb Matzke</a></address>
|
||||
<!-- Created: Fri Apr 17 13:39:35 EDT 1998 -->
|
||||
<!-- hhmts start -->
|
||||
Last modified: Fri Apr 17 16:15:21 EDT 1998
|
||||
<!-- hhmts end -->
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user