* Updates the mirror VFD tests so they compile and pass tests * Committing clang-format changes Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
497 lines
21 KiB
Markdown
497 lines
21 KiB
Markdown
# Welcome to VFD SWMR
|
|
|
|
Thank you for volunteering to test VFD SWMR.
|
|
|
|
SWMR, which stands for Single Writer/Multiple Reader, is a feature
|
|
of the HDF5 library that lets a process write data to an HDF5 file
|
|
while one or more processes read the file. Use cases range from
|
|
monitoring data collection and/or steering experiments in progress
|
|
to financial applications.
|
|
|
|
The following diagram illustrates the original version of SWMR.
|
|
|
|
<img src = SWMRdataflow.png width=400 />
|
|
|
|
The original version of SWMR functions by ordering metadata writes to
|
|
the HDF5 file so as to always maintain a consistent view of metadata
|
|
in the HDF5 file -- which requires SWMR specific modifications to
|
|
all code that maintains on disk metadata.
|
|
|
|
VFD SWMR is designed to be a more maintainable and more modular
|
|
replacement for the existing SWMR feature. It functions by taking
|
|
regular snapshots of HDF5 file metadata on the writer side, and using
|
|
a specialized virtual file driver (VFD) on the reader side to
|
|
intercept metadata read requests and satisfy them from the
|
|
snapshots where appropriate -- thus assuring that the readers
|
|
see a consistent view of HDF5 file metadata,
|
|
|
|
This design allowed us to implement VFD SWMR with only minor
|
|
modifications to the HDF5 library above metadata cache and page
|
|
buffer. As a result, not only is VFD SWMR more modular and
|
|
easier to maintain, it is also almost "full SWMR" -- that is it
|
|
allows use of almost all HDF5 capabilities by VFD SWMR writers,
|
|
with results that become visible to the VFD SWMR readers.
|
|
|
|
In particular, VFD SWMR allows the writer to create and delete
|
|
both groups and datasets, and to create and delete attributes on
|
|
both groups and datasets while operating in VFD SWMR mode --
|
|
which is not possible using the original SWMR implementation.
|
|
|
|
We say that VFD SWMR is almost "full SWMR" because there are a
|
|
few limitations -- most notably:
|
|
|
|
* The current implementation of variable length data in datasets
|
|
is fundamentally incompatible with VFD SWMR, as it stores variable
|
|
length data as metadata. This shouldn't be a major issue, as the
|
|
current implementation of variable length data has very poor performance,
|
|
and thus is not suitable for most SWMR applications. A new
|
|
implementation of variable length data is in the works, and should
|
|
offer both better performance and be compatible with VFD SWMR.
|
|
However, there is no ETA for delivery. Variable length attributes
|
|
on datasets and groups should work, but are currently un-tested.
|
|
|
|
* At present the Virtual Data Set (VDS) feature is not
|
|
well integrated with VFD SWMR. While we have a work around that
|
|
allowed us to test for more fundamental issues (sse below), a proper
|
|
solution is on hold pending the availability of the original developer.
|
|
|
|
* VFD SWMR is only tested with, and should only be used with
|
|
the latest HDF5 file format. Theoretically, there is no functional
|
|
reason why it will not work with earlier versions of the file format.
|
|
However, it is possible to construct very large pieces of metadata
|
|
in early versions of the HDF5 file format, which has the potential to
|
|
cause major performance issues.
|
|
|
|
Due to its regular snapshots of metadata, VFD SWMR provides guarantees
|
|
on the maximum time from write to visibility to the readers -- with
|
|
the provisos that the underlying file system is fast enough, that
|
|
the writer makes HDF5 library API calls with sufficient regularity, and
|
|
that both reader and writer avoid long running HDF5 API calls.
|
|
|
|
For further details on VFD SWMR design and implementation, see
|
|
`VFD_SWMR_RFC_200916.pdf` in the doc directory.
|
|
|
|
# Quick start
|
|
|
|
Follow these instructions to download, configure, and build the
|
|
VFD SWMR project, then install the HDF5 library and
|
|
utilities built by the VFD SWMR project.
|
|
|
|
## Download
|
|
|
|
Clone the HDF5 repository in a new directory, then switch to the
|
|
`feature/vfd_swmr_beta_1` branch as follows:
|
|
|
|
```
|
|
% git clone https://github.com/HDFGroup/hdf5 swmr
|
|
% cd swmr
|
|
% git checkout feature/vfd_swmr_beta_1
|
|
```
|
|
|
|
## Build
|
|
|
|
There are no special instructions for building VFD SWMR. Simply follow
|
|
the usual build procedure for CMake or the Autotools using the guides
|
|
in the `release_docs` directory.
|
|
|
|
IMPORTANT:
|
|
|
|
The VFD SWMR branches are maintenance branches and will default to a debug
|
|
build. They also do not come with generated files, so Perl will be required
|
|
when building with CMake and Perl and the Autotools (autoconf, etc.) will
|
|
be required when building with the Autotools.
|
|
|
|
Some notes:
|
|
|
|
- The VFD SWMR tests can take some time to run.
|
|
- The VFD SWMR acceptance tests will typically emit some output about "expected errors" that you can ignore. Real errors are clearly flagged.
|
|
- If the tests do not pass on your system, please let the developers know via the email address given at the end of this document.
|
|
|
|
# Sample programs
|
|
|
|
## Extensible datasets
|
|
|
|
For an example of a program that uses VFD SWMR to write/read many
|
|
extensible datasets, have a look at `test/vfd_swmr_bigset_writer.c`, the
|
|
"bigset" test. We compile two binaries from that source file, one that
|
|
operates in write mode, and a second that operates in read mode.
|
|
|
|
In write mode, "bigset" creates an HDF5 file containing one or more
|
|
datasets that are extensible in either one dimension or two. Then it
|
|
runs for several steps, increasing the size of each dataset in each
|
|
dimension once every step. The dimensions, number of datasets, the
|
|
step increase in dataset size, and the number of steps are configurable
|
|
using command-line options -d, -s, -r and -c, and -n, respectively---use
|
|
the -h option to get a usage message. Each dataset is written with a
|
|
predictable pattern.
|
|
|
|
In read mode, "bigset" reads each dataset from an HDF5 file created
|
|
by a "bigset" writer and verifies the patterns. It takes the same
|
|
command-line parameters as the "bigset" writer. The reader and writer
|
|
may run concurrently; the reader "polls" the content until it is just
|
|
shy of complete, given the number of steps expected.
|
|
|
|
To run a bigset test, open a couple of terminal windows, one for the
|
|
reader and one for the writer. cd to the `test` directory under
|
|
my build directory, and run the writer in one window:
|
|
|
|
```
|
|
% ./vfd_swmr_bigset_writer -n 50 -d 2
|
|
```
|
|
|
|
and in the other window, run the reader:
|
|
|
|
```
|
|
% ./vfd_swmr_bigset_reader -n 50 -d 2 -W
|
|
```
|
|
|
|
The writer will wait for a signal before it quits. You can tap CTRL-C to make
|
|
it quit.
|
|
|
|
The reader and writer programs support several command-line options:
|
|
|
|
```
|
|
usage: vfd_swmr_bigset_writer [-F] [-M] [-S] [-V] [-W] [-a steps] [-b] [-c cols]
|
|
[-d dims]
|
|
[-n iterations] [-r rows] [-s datasets]
|
|
[-u milliseconds]
|
|
|
|
-F: fixed maximal dimension for the chunked datasets
|
|
-M: use virtual datasets and many source files
|
|
-S: do not use VFD SWMR
|
|
-V: use virtual datasets and a single source file
|
|
-W: do not wait for a signal before exiting
|
|
-a steps: `steps` between adding attributes
|
|
-b: write data in big-endian byte order
|
|
-c cols: `cols` columns per chunk
|
|
-d 1|one|2|two|both: select dataset expansion in one or
|
|
both dimensions
|
|
-n iterations: how many times to expand each dataset
|
|
-r rows: `rows` rows per chunk
|
|
-s datasets: number of datasets to create
|
|
-u ms: milliseconds interval between updates
|
|
to vfd_swmr_bigset_writer.h5
|
|
```
|
|
|
|
## The VFD SWMR demos
|
|
|
|
The VFD SWMR demos are located in the `examples` directory of this source
|
|
tree. Instructions for building the example programs are given in the README
|
|
file in that directory. These programs are NOT installed via `make install`
|
|
and have to built by hand with h5cc as described in the README.
|
|
|
|
Two Gaussian programs are built, `wgaussians` and `rgaussians`. If you start
|
|
both from the same directory in different terminals, you should see the
|
|
"bouncing 2-D Gaussian distributions" in the `rgaussians` terminal. This demo
|
|
uses curses, so you may need to install the curses developers library to build
|
|
(and this is probably not going to be easy to build on Windows).
|
|
|
|
The creation-deletion (`credel`) demo is also run in two terminals.
|
|
The two command lines are given in the README. You need to use the `h5ls`
|
|
installed from the VFD SWMR branch, since only that version has the `--poll`
|
|
option. Be careful to not use a non-VFD-SWMR system h5ls here.
|
|
|
|
# Developer tips
|
|
|
|
## Configuring VFD SWMR
|
|
|
|
### File-creation properties
|
|
|
|
To use VFD SWMR, creating your HDF5 file with a paged allocation strategy
|
|
is mandatory. This call enables the paged allocation strategy:
|
|
|
|
```
|
|
ret = H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, false, 1);
|
|
```
|
|
|
|
Allocated storage that is smaller than the page size will
|
|
not overlap a page boundary, and allocated storage that is one page or
|
|
greater in size will start on a page boundary. VFD SWMR relies on that
|
|
allocation strategy.
|
|
|
|
### File-access properties
|
|
|
|
In this section we show how to configure your application to use VFD
|
|
SWMR.
|
|
|
|
1. Create a file access property list using `H5Pcreate(H5P_FILE_ACCESS)`.
|
|
2. Set the latest file format using `H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST)`.
|
|
3. Enable page buffering using `H5Pset_page_buffer_size()`.
|
|
4. Set any VFD SWMR configuration properties using `H5Pset_vfd_swmr_config()`. The struct is documented in H5Fpublic.h, with some additional documentation below. (In the near future, this struct will be documented in the library's Doxygen documentation.)
|
|
|
|
VFD SWMR relies on metadata reads and writes to go through the
|
|
page buffer. Note that the default page size is 4096 bytes. Finding good
|
|
values for `buf_size` may take some experimentation. We use 4096 (giving a
|
|
single page buffer) for `buf_size` in our test code.
|
|
|
|
*Note well*: when VFD SWMR is enabled, the meta-/raw-data pages proportion
|
|
set by `H5Pset_page_buffer_size()` does not actually control the
|
|
pages reserved for raw data. *All* pages are dedicated to buffering
|
|
metadata.
|
|
|
|
### `H5F_vfd_swmr_config_t` fields discussion
|
|
|
|
Example code:
|
|
|
|
```
|
|
memset(&config, 0, sizeof(config));
|
|
|
|
config.version = H5F__CURR_VFD_SWMR_CONFIG_VERSION;
|
|
config.tick_len = 4;
|
|
config.max_lag = 7;
|
|
config.writer = true;
|
|
config.md_pages_reserved = 128;
|
|
strcpy(config.md_file_path, "./my_md_file");
|
|
|
|
H5Pset_vfd_swmr_config(fapl, &config);
|
|
```
|
|
|
|
When VFD SWMR is enabled, changes to the HDF5 metadata accumulate in
|
|
RAM until a configurable unit of time known as a *tick* has passed.
|
|
At the end of each tick, a snapshot of the metadata at the end of
|
|
the tick is "published"---that is, made visible to the readers.
|
|
|
|
The length of a *tick* is configurable in units of 100 milliseconds
|
|
using the `tick_len` parameter. Here, `tick_len` is set to `4` to
|
|
select a tick length of 400ms.
|
|
|
|
A snapshot does not persist forever, but it expires after a number
|
|
of ticks, given by the *maximum lag*, has passed. Here, `max_lag`
|
|
is set to `7` to select a maximum lag of 7 ticks. After a snapshot
|
|
has expired, the writer may overwrite it.
|
|
|
|
When a reader first enters the API, it starts to use, or "selects,"
|
|
the metadata in the newest snapshot, and on every subsequent API
|
|
entry, if a tick has passed since the last selection, and if new
|
|
snapshots are available, then the reader selects the latest.
|
|
|
|
If a reader spends longer than `max_lag - 1` ticks (2400ms with
|
|
the example configuration) inside the HDF5 API, then its snapshot
|
|
may expire, resulting in undefined behavior. When a snapshot
|
|
expires while the reader is using it, we say that the writer has
|
|
"overrun" the reader. The writer cannot detect overruns.
|
|
Frequently the reader will detect an overrun and force the program
|
|
to exit with a diagnostic assertion failure.
|
|
|
|
The application tells VFD SWMR whether or not to configure for
|
|
reading or writing a file by setting the `writer` parameter to
|
|
`true` for writing or `false` for reading.
|
|
|
|
VFD SWMR snapshots are stored in a "metadata file" that is shared
|
|
between writer and readers. On a POSIX system, the metadata file
|
|
may be placed on any *local* filesystem that the reader and writer
|
|
share. The `md_file_path` parameter tells where to put the metadata
|
|
file.
|
|
|
|
The `md_pages_reserved` parameter tells how many pages to reserve
|
|
at the beginning of the metadata file for the metadata-file header
|
|
and the metadata index. The header has an entire page to itself.
|
|
The remaining `md_pages_reserved - 1` pages are reserved for the
|
|
metadata index. If the index grows larger than its initial
|
|
allocation, then it will move to a new location in the metadata file,
|
|
and the initial allocation will be reclaimed. `md_pages_reserved`
|
|
must be at least 2.
|
|
|
|
The `version` parameter tells what version of VFD SWMR configuration
|
|
the parameter struct `config` contains. For now, it should be
|
|
initialized to `H5F__CURR_VFD_SWMR_CONFIG_VERSION`.
|
|
|
|
## Using virtual datasets (VDS)
|
|
|
|
An application may want to use VFD SWMR to create, read, or write
|
|
a virtual dataset. Unfortunately, VDS does not work properly with
|
|
VFD SWMR at this time. In this section, we describe some workarounds
|
|
that can be used with great care to make VDS and VFD SWMR cooperate.
|
|
|
|
A virtual dataset, when it is read or written, will open files on
|
|
an application's behalf in order to access the source datasets
|
|
inside. If a virtual dataset resides on file `v.h5`, and one of
|
|
its source datasets resides on a second file, `s1.h5`, then the
|
|
virtual dataset will try to open `s1.h5` using the same file-access
|
|
properties as `v.h5`. Thus, if `v.h5` is open with VFD SWMR with
|
|
metadata file `v.shadow`, then the virtual dataset will try to open
|
|
`s1.h5` with the same metadata file, which will fail.
|
|
|
|
Suppose that `v.h5` is *not* open with VFD SWMR, but it was opened
|
|
with default file-access properties. Then the virtual dataset will
|
|
open the source dataset on `s1.h5` with default file-access
|
|
properties, too. This default virtual-dataset behavior is not
|
|
helpful to the application that wants to use VFD SWMR to read or
|
|
write source datasets.
|
|
|
|
To use VFD SWMR with VDS, an application should *pre-open* each file
|
|
using its preferred file-access properties, including independent metadata
|
|
filenames for each source file. As long as the virtual dataset remains
|
|
in use, the application should leave each of the pre-opened files open.
|
|
In this way the library, when it tries to open the source files, will
|
|
always find them already open and re-use the already-open files with the
|
|
file-access properties established on first open.
|
|
|
|
## Pushing HDF5 raw data to reader visibility
|
|
|
|
At present, VFD SWMR is hard coded to flush raw data at the end of
|
|
each tick. While this imposes additional overhead, it simplifies testing,
|
|
and is probably desirable for applications that do not require the best
|
|
possible raw data throughput. We plan to upgrade our tests and make this
|
|
user configurable in the first production release.
|
|
|
|
With the currently hard coded flush of raw data at the end of each tick,
|
|
it should not be necessary to call H5Fflush(). In fact, when VFD SWMR is
|
|
active, H5Fflush() may require up to `max_lag` ticks to complete due to
|
|
metadata consistency issues.
|
|
|
|
Instead, a writer can make its last changes to HDF5 file visible to all
|
|
readers immediately using the new call, `H5Fvfd_swmr_end_tick()`. Note
|
|
that this call should be used sparingly, as it terminates the current
|
|
tick early, thus effectively reducing `max_lag`. Repeated calls in
|
|
quick succession can force a reader to overrun `max_lag`, and
|
|
read stale metadata.
|
|
|
|
When the flush of raw data at end of tick is disabled (not possible at present),
|
|
the `H5Fvfd_swmr_end_tick()` call will make the writers current view of metadata
|
|
visible to the reader -- which may refer to raw data that hasn't been written to
|
|
the HDF5 file yet.
|
|
|
|
## Reading up-to-date content
|
|
|
|
One expected use case for VFD SWMR involves an experiment in which instruments
|
|
continuously generate 2-dimensional data frames. These data frames are recorded
|
|
in datasets in a HDF5 file that has been opened in VFD SWMR writer mode. In this
|
|
use case, the HDF5 file is opened in VFD SWMR reader mode by a second program
|
|
that generates a real time display of the data as it is being collected -- thus
|
|
allowing the experimenters to steer the experiment.
|
|
|
|
THG developed a demonstration program for class of application,
|
|
and we have some advice based on that experience.
|
|
|
|
The writer typically will increase a dataset's dimensions by a
|
|
frame, using `H5Dset_extent()`, before it writes the data of that
|
|
frame with `H5Dwrite()`. It's possible that a snapshot of the HDF5
|
|
file will propagate to the reader between the `H5Dset_extent()`
|
|
call and the `H5Dwrite()`. Values `H5Dread()` from the last frame
|
|
at that juncture will not reflect the actual experimental data.
|
|
Instead, the reader will see arbitrary values or the fill value.
|
|
To display those values would be distracting and misleading to
|
|
the experimenter.
|
|
|
|
On the reader, a strategy for displaying the most current, bonafide application
|
|
data is to read the dimensions of the frames dataset, `d`, compute
|
|
the number `n` of full frames contained in `d`, and read the
|
|
next-to-last frame, `n - 2`. THG uses a variant of this strategy
|
|
in its `gaussians` demo.
|
|
|
|
On the writer, a strategy for protecting against snapshots between
|
|
the `H5Dset_extent()` and `H5Dwrite()` calls is to suspend VFD
|
|
SWMR's clock across both of the calls. The
|
|
`H5Fvfd_swmr_disable_end_of_tick()` call takes a file identifier
|
|
and stops new snapshots from being taken on the given file until
|
|
`H5Fvfd_swmr_enable_end_of_tick()` is called on the same file.
|
|
Needless to say, end of tick processing should only be disabled
|
|
briefly.
|
|
|
|
# Known issues
|
|
|
|
## Variable-length data
|
|
|
|
A VFD SWMR reader cannot reliably read back a variable-length dataset
|
|
written by VFD SWMR. For example, a variable-length string
|
|
created and written as follows
|
|
|
|
```
|
|
hid_t dset, space, type;
|
|
char data[] = "content";
|
|
|
|
type = H5Tcopy(H5T_C_S1);
|
|
|
|
H5Tset_size(type, H5T_VARIABLE);
|
|
|
|
space = H5Screate(H5S_SCALAR);
|
|
|
|
dset = H5Dcreate2(..., "string", type, space, H5P_DEFAULT, H5P_DEFAULT,
|
|
H5P_DEFAULT);
|
|
|
|
H5Dwrite(dset, type, space, space, H5P_DEFAULT, &data);
|
|
```
|
|
|
|
and read back like this,
|
|
|
|
```
|
|
char *data;
|
|
herr_t ret;
|
|
|
|
ret = H5Dread(..., ..., H5S_ALL, H5S_ALL, H5P_DEFAULT, &data);
|
|
```
|
|
|
|
may produce either an error return from `H5Dread` (`ret < 0`) or
|
|
a `NULL` pointer (`data == NULL`).
|
|
|
|
As discussed above, this is caused by a fundamental incompatibility
|
|
between the current variable length data implementation in HDF5, which
|
|
stores variable length data as metadata. It is possible we may be able
|
|
to mitigate the issue, but the most likely solution is the planned
|
|
re-implementation of variable length data that is currently in the planning
|
|
stage. Unfortunately, we have no ETA for this re-implementation.
|
|
|
|
## Iteration
|
|
|
|
An application that reads in VFD SWMR mode should take care to avoid
|
|
HDF5 iteration APIs, especially when iterating large numbers of objects
|
|
or using long-running application callbacks. While the library is in an
|
|
iteration routine, it cannot track changes made by the writer. If the
|
|
library spends more than `max_lag` ticks in the routine, then its view
|
|
of the HDF5 file will become stale. Under those circumstances, HDF5
|
|
content could be mis-read, or the library could crash with a diagnostic
|
|
assertion.
|
|
|
|
## Object handles
|
|
|
|
At the present level of development, the writer cannot invalidate
|
|
a reader's HDF5 object handles (`hid_t`s). If a reader holds an
|
|
object open---that is, it has a valid handle (`hid_t`) for the
|
|
object---while the writer deletes it, then reading content through
|
|
the handle may yield corrupted data or the data from some other
|
|
object, or the library may crash.
|
|
|
|
## Supported filesystems
|
|
|
|
A VFD SWMR writer and readers share a couple of files, the HDF5 (`.h5`)
|
|
file and the metadata file -- which is used to communicate snapshots of
|
|
the HDF5 file metadata from the writer to the readers. VFD SWMR relies
|
|
on writes to the metadata file to take effect in the order described in
|
|
the POSIX documentation for `read(2)` and `write(2)` system calls. If
|
|
the VFD SWMR readers and the writer run on the same POSIX host, this
|
|
ordering should take effect, regardless of the underlying filesystem.
|
|
|
|
If the VFD SWMR reader and the writer run on *different* hosts, then
|
|
the write-ordering rules depend on the shared filesystem. VFD SWMR is
|
|
not generally expected to work with NFS at this time. Parallel file systems
|
|
like GPFS and Lustre should order writes according to POSIX convention, so we
|
|
expect VFD SWMR to work on those file systems but we have not tested this.
|
|
|
|
The HDF Group plans to add support for networked file systems like NFS and
|
|
Windows SMB to VFD SWMR in the future.
|
|
|
|
## Microsoft Windows
|
|
|
|
VFD SWMR is not officially supported on Microsoft Windows at this time. The
|
|
feature should in theory work on Windows and NTFS, however it has not been
|
|
tested as the existing VFD SWMR tests rely on shell scripts. Note that Windows
|
|
file shares are not supported as there is no write ordering guarantee (as with
|
|
NFS, et al.).
|
|
|
|
## File-opening order
|
|
|
|
If an application tries to open a file in VFD SWMR reader mode, and the
|
|
file is not already open by a VFD SWMR writer, then the application will
|
|
sleep in the `H5Fopen()` call until either the writer opens the same
|
|
file (using the same shadow file) or the reader times out after several
|
|
seconds.
|
|
|
|
# Reporting bugs
|
|
|
|
VFD SWMR is still under development, so it is possible that you will encounter
|
|
bugs. Please report them, along with any performance or design issues you
|
|
encounter.
|
|
|
|
To contact the VFD SWMR developers, email vfdswmr@hdfgroup.org.
|