Universal Environmental Data file format.

Roger Nelson
rnelson@wsu.edu

Abstract

There are a number of environmental (weather) file formats in use. Typically environmental data formats are often either text or a binary format Relational database systems a becoming popular (although I would discourage this due to their inherent inefficiencies). The advantages of using text, or binary file formats is discussed in this article. Relational database systems are another alternative.

Many text and binary file formats are already available. For example, it seem each weather data service has its own format. but a number of problems/deficiencies exists with these formats.

This article discusses a general binary weather file format which attempts to avoid the problems with the existing binary and text file formats.

UED is currently implemented as a binary file format in deference to the advantages afforded by the structuring of the binary data. There are some instances where the advantages of a text base file format (particularly portability and ability to view/edit the data with a text editor as mentioned below) may be preferable to the disadvantages (Particularly storage requirements). It is feasible to create a text based format with similar features for annotation and efficient data structuring. Please contact rnelson@wsu.edu if you are considering creating a text base format with the features of UED.

Text verses Binary verses Relational databases verses spreadsheets.

Weather data files are either formatted as plain text or as binary data. Text files are often used because they have the advantage of being easy to read by people and can usually be read into application program with little effort. And can usually be easily edited with text editing software found on any computer.

Binary data files are not human readable: in order to view the data, a specific program must be written to allow people to view/edit the data.

Relational databases store data in tabular data structures within the database system. Data is viewed/edited using specific database software.

One might first ask: why favour one format over the other if the data can be equally represented in any of the formats. The following table illustrates the advantages and disadvantages of using text or binary file formats:

This can often further cut in half file size requirements.
Feature Text Binary Rationale as applies to UED
Readability by humans + Easy to read - Effectively impossible to read While it would certainly be advantageous to be able to easily read data files directly without the need for special software, environmental time series data often involves large data sets; the analysis of the data will usually already require specialized visualization software.
Viewing software + Requires no special program - Requires a program specifically designed to read the data
Editability + Can usually edit with a text editor
- Users can introduce errors in editing
- A special program is needed for editing
+ The editing software can help prevent errors
+ The editing software can provide features to help simplify editing data
There are few situations where end users will need or even want to edit data values; even when the initial data recording is usually in some computerized manner (Due to recording equipment failure or false readings). Data editing programs are preferable to editing data with a text editor. In cases where it is necessary to enter data from text, conversion programs can be provided for the user.
Security - There are often situations where data providers don't necessarily want users to be able to directly view and/or edit data. + Users are less likely to attempt to directly modify binary data when they can't easily read it. For example, weather data sets might be considered "Read only" for end users.
Parsability - Depending on the layout of the data, text files can be difficult to read since users might introduce various spacing or use different delimiters. These different possibilities must be accounted for when reading the data file. + Because binary files are always written by programs, there will always be a consistent format, so it is much easier for the program to read the data.
Annotation + It is easy to add comments to text.
- The data parse must be able to accommodate annotations.
- While provisions for annotations can always added to binary data file formats, annotations of variable length can disrupt the simplicity of the binary file format, especially when computing relative record addresses. The binary file format presented here will provide options for marking data as valid, estimated, generated etc. in a consistent, and efficient way.
Identifying the format when no documentation is available + A user with some knowledge of the data might be able to reconstruct the format of the file by viewing the data. - With no knowledge of the format of the data it is virtually impossible to identify the format of the data, The binary format presented here will use a self-documenting format encoding that will allow future extension and modification of the format as needs arise.
Encoding the format in the file itself - Some how encoding the format of the file in the data file itself may disrupt the simplicity of the text file format potentially making the file more difficult to read by people, and also more difficult for programs to parse. + The format of the file can be easily encoded in the file itself without significantly adding the the complexity of parsing the data.
Portability + Text data is usually ASCII text readable by most computer software. - There are a number of floating point representations and conversions may need to be done.
Also, computer processors are either big endian or little endian data values wider than 1 byte may also require conversion.
While a good text file format allows weather data to be imported to a variety of application programs such as spreadsheets and statistical software, A good binary file format with the many advantages that binary format provides will be adopted by both data providers and software that uses weather data. Binary file formats are generally easier to program especially in cases where text data might potentially be improperly formatted.
I/O speed - Slow : Converting text to/from internal binary representation is a rather time consuming process + Orders of magnitude faster than reading/writing text. The binary format presented here attempts to utilize, to the fullest, the advantages of fast I/O speed and access and efficiency of space avoiding superfluous and redundant data.
I/O Access - Access to individual records usually requires sequential I/O. since text file formats tend to have free form layouts, it is often more difficult to make changes to data in place, requiring a second completely new file to be produced in order to store any changes. + Binary data is usually formatted such that record addresses can be easily computed for fast file pointer relative file access.
Floating point precision - Providing additional precision requires additional file size. + The binary floating point representation is optimized to allow the most precision possible for any given floating point type of a given size.
File size - Text representation of data always requires much more file space then binary data.
- Additional space is required for greater precision of floating point data.
- Additional space is required to make the presentation of the data more appealing and easier to read.
+ Binary data is the most compact representation of data. This, on average, means binary files are more than half the size of text files and most relation database tables.
+ Compression algorithms can be applied to binary data.
The current UED format does not provide specifications for data compression, but this could be added latter if needed.

Overview

The file format should have the following properties:
Simple
The file format should be fairly simple requiring minimal coding to implement.
Fairly compact with minimal repetition of data.
For example, a number of existing data formats repeat things such as elevation for each record of data.
Day of year based data organization
There are many annual data formats which are based on months that do stupid things like providing space for 31 days filling in day 29, 30, and 31 with filler values for months that don't have those days. This is a waste of space and causes data formatting headaches.
Data all valid and usable
Most file formats provide limited or no means to identify the individual data values as actual, estimated, generated etc, and rather that providing a usable value for missing data, will enter an marking value (i.e. -999 or something). For most applications, this makes the data set unusable without additional preprocessing, or support for missing data in the application program itself. We would like to provide usable values for missing days, while preserving the integrity of the original possibly incomplete dataset.
Extensible format
The format should be such that provisions can be made to future additions of data types, and older applications will be able to use data available without having to be updated to read the new formats. We allow individual applications to use this format adding their own data types as needed.
Flexible format
The format should allow data in various units and data types. I.e. precipitation stored in cm or mm.
Concatability
The format should allow appending additional records or even new data sets without having to completely rebuild the file.
Compact
The flexible format should be compact so that various time step recording levels can be used without having to reserve too much space for the case of more frequent timesteps.

Codification

UED uses a methodically derived hierarchical set of binary codes used to identify and/or represent various data including: record type, variable codes, units codes. and attribute flags

Existing weather file formats

NCDC
Advantages:
Disadvantages:
Comparison:

Data records

In order to the format to be extensible, this data format uses a technique for encoding data sets similar to that used by spreadsheet programs and graphic image files.

Each record is prefixed by a 6 byte header:

  1. The first byte is the record type code used to identify the data in the record.
  2. The second byte is reserved for future use (To extend the available record codes) At this time this byte should be 0.
  3. The next 4 bytes are a 32bit number denoting the size of the body of the record in bytes. (The size does not include the 6 bytes that comprise the header.)
For integer values the most significant bytes are stored first. Big endian, in computer programming parlance, integer words are reconstructed by reading the first byte which is shifted left 8 bits and bitwise or'd with the second byte.

The record header allows old programs to be able to still read and use data files even if new record types have been added to the file, this satisfies the requirement of extensibility Programs may skip over records of unknown type.

Record type

The record type value itself has a specific format that may allow programs to recognize even new types. This section has some very specific information as to how record type numbers are derived. [You may wish to skip over this section] .

The binary format of the record type word.

         msb           lsb
         +-+-+-+-+-+-+-+-+
         | | | | | | | | |
         +-+-+-+-+-+-+-+-+
          7 6 5 4 3 2 1 0

          |
          +-Bit 7 is always 0 for a standard record type.
            Programmers desiring to create new record types for specific appliation
            must use bit 7 set to 1 (See below).
            If programmers intend to make their record type part of the standard,
            they must follow the record type encoding that follows.

            |
            +-
           0|0 Bit 6 = 0 always indicates the record contains a data set or value
               irregardles of whether the record type is standard or an application
               specific record type.

              |
              +-
           0 0|0 Bit 5 = 0
                       Indicates a single value or group of values for a time period.
                       

                |
                +-
                |0 Bit 4 = 0 indicates a single scalar variable
                |1 Bit 4 = 1 indicates an array or set of values (I.e. layering record)

                   Bits 0-3 apply to all data sets or data values regardless of the value of bits 4 and 5, as described below.
              |
              +-
           0 0|1 Bit 5 = 1 Indicates a set of values (time series) for a time period
                       The data record will contain a time step indicator
                |
                +-
           0 0 0|0 Bit 4 = 0 is currently always 0
           0 0 0|1 Bit 4 = 1 is currently undefined (reserved for future use)
                             (records are now always time stamped

                  Bits 0-3 apply to all data sets or data values regardless of the value of bits 4 and 5.
                  | |
                  +-+  Bits 2-3 Denote the time stamp indicators used
                       (applies to bit 3 either 1 or 0).
                       The following time indicates have been defined

           0 0 D D 0 0 Data stamped with year only.
           0 0 D D 0 1 Data stamped with time only.
           0 0 D D 1 0 Data stamped with a date only.
           0 0 D D 1 1 Data stamped with a date and time.

                      |
                      +-
           0 0 D D D D|0 Bit 1 = 0 indicates the record contains a variable code
           0 0 D D D D|1 Bit 1 = 1 indicate the variable code has been omitted
                            The variable code will only be omitted if a global
                              variable control code has been previously applied.

                        |
                        +-
           0 0 D D D D D|0 Bit 0 = 0 indicates the record contains a variable units code
           0 0 D D D D D|1 Bit 0 = 1 indicates the variable units code has been omitted.
                               The units code will only be omitted if a global variable
                               units control code has been previously applied.
            |
            +-
           0|1 Bit 6 = 1 always indicates the record contains information
               that is not a data set. (I.e comments or control)

              |
              +-
           0 1|0 Bit 5 = 0 indicates the record type is a control record type

                |
                +-
           0 1 0|0 Bit 4 = 0 indicates control applies until over ridden
                  |
                  +-
           0 1 0 0|0 Bit 3 = 0 indicates a marker
                    Bits 0-1 denote the 7 marker codes
           0 1 0 0 0|0 0 0 Beginning of file 16 bit standard revision number follows size
           0 1 0 0 0|0 0 1
           0 1 0 0 0|0 1 0
           0 1 0 0 0|0 1 1
           0 1 0 0 0|1 0 0
           0 1 0 0 0|1 0 1
           0 1 0 0 0|1 1 0
           0 1 0 0 0|1 1 1 End of file

                  |
                  +-
           0 1 0 0|1  Bit 3 = 1 Not yet defined
           0 1 0 0 1 x x x
                |
                +-
           0 1 0|1 Bit 4 = 1 indicates a global control (applies to whole database)

                  |
                  +-
           0 1 0 1|1 Bit 3 = 1 Commentary record
                    Bits 0-2 denotes commentary information

           0 1 0 1 1|0 0 0 A General comment (01011000) (followed by null terminated string)
           0 1 0 1 1|0 0 1 Description of the database (followed by a null terminated string)
           0 1 0 1 1|0 1 0 Information about the generating application (followed by a 16integer and a null terminated string)
           0 1 0 1 1|0 1 1 reserved for future use
           0 1 0 1 1|1 0 0 Location information (Followed by location information record)
           0 1 0 1 1|1 0 1 reserved for future use
           0 1 0 1 1|1 1 0 reserved for future use
           0 1 0 1 1|1 1 1 reserved for future use
                  |
                  +- Bit 3 = 0 Not yet defined
           0 1 0 1|0 x x x
              +-
           0 1|1 Bit 5 = 1 Indicates a definition

                +-
           0 1 1|0 Bit 4 = 0 indicates a specification

                   Bits 0-3 denote the type of definition

           0 1 1 0|0 0 0 0   Variable format 0
           0 1 1 0|0 0 0 1   Variable format 1 (reserved for future use)
           0 1 1 0|0 0 1 0   Variable format 2 (reserved for future use)
           0 1 1 0|0 0 1 1   Variable format 3 (reserved for future use)

           0 1 1 0|0 1 0 0   Units format 0
           0 1 1 0|0 1 0 1   Units format 1 (reserved for future use)
           0 1 1 0|0 1 1 0   Units format 2 (reserved for future use)
           0 1 1 0|0 1 1 1   Units format 3 (reserved for future use)

           0 1 1 0|1 0 0 0   Conversion function format 0
           0 1 1 0|1 0 0 1   Conversion function format 1 (reserved for future use)
           0 1 1 0|1 0 1 0   Conversion function format 2 (reserved for future use)
           0 1 1 0|1 0 1 1   Conversion function format 3 (reserved for future use)

           0 1 1 0|1 1 0 0   (reserved for future use)
           0 1 1 0|1 1 0 1   (reserved for future use)
           0 1 1 0|1 1 1 0   (reserved for future use)
           0 1 1 0|1 1 1 1   (reserved for future use)

                +-
           0 1 1|1 Bit 4 = 1 indicates a data format definition
           0 1 1 1|x x x x  The bit layout for this format has not yet been defined yet.

          |
          +-
             Bit 7 Set indicates a non-standard record type

             If Bit 7 is set, bits 0-6 can be defined any way by an application,
             however, if the user wishes to later have the data type standardized,
             the bit format should follow as if the type if it was a standard type.


Example record types

This section give some example frequently used data record types part of a more complete list of currently defined record type codes

Data records

Yearly data set record format

Yearly Data set records would have the following record type bit pattern:
   8 7 6 54 3 2 1
   X 0 0 00 1 X X
Thus the the following format:
int16 year The year of the data set
uint32 variable code May be omitted if bit 2 set
uint32 value units code The units of the values May be omitted if bit 1 set
byte attribute This summerizes the quality of the data for the year
Note that multiple attribute bits can be set.
byte time step units code I.e. Year (Month not implemented) Day Hour Second
int16 num_values The number of values
float values[num_values]
byte attributes[num_values](present only if attribute not = 0)

Daily data set record format

Daily data set records have the following format:

Date date
uint32 variable_code
uint32value units code The units of the values
byte attribute This summarizes the quality of the data for the year Note that multiple attribute bits can be set. byte time step units code I.e. Day Hour Second
uint16 num_values The number of values
float values[num_values]
byte attributes[num_values] (present only if attribute not = 0)

Scalar (individual value) dated record format

Scalar Data value records are used to specify very sparse data, when individual records would be more space efficient than data set records.

They have have the following format:
Date date
Time time will be 0 if daily value
uint32 variable code
uint32 units code
byte attribute
float value

Vectors (a set of values for a given time step) dated record format

Vector data value records are used to specify a group of values for a specific condition for a given time step (For example for a given year, the thickness of soil horizons. All values share the same attribute so there will never be an attributes array.

They have have the following format:
Date date
Time time will be 0 if daily value
uint32 variable code
uint32 units code
byte attribute
uint16 num_values The number of values
float values[num_values]

Time step :

The time step code is one of the standard time units codes.

For example a year record with time step unit Year with 4 values is a quarterly set of values. This is equivalent to a year records with time step unit month with 4 values

Control records

Another class of record types are control records.
Markers
A set of markers are reserved for identifying sections of the file Currently "Beginning of file" and "End of file" are defined.
Location identification
Because of the size of the files involved, the preferred method for storing data is one file for each station; however, in some situations, it might be preferable to store datasets for multiple stations in a single file. This makes it necessary to separate data from different stations. Data will be sorted by station, the station identification record is placed at the start the block of data records for that station. The station identification record contains the following information:
Apply global time stamp
Apply global time step
Apply global variable
Apply global variable units
Index

Currently defined variable codes


These variable codes can be qualified for the time step, I.e. Actual, Potential; Maximum, Minimum, Average, etc..

Quality Attributes byte

Bits 0 - 2 of the attributes byte identify the quality of the data. Thus the following values:
0 = The value is actual data.
1 = The value was calculate from real data. (I.e. Dew point temperature or potential solar radiation)
2 = The value was estimated from real data. (I.e. estimated solar radiation)
3 = A forecast value (bits 1 and 2 set)
4 = A Generated value based on actual data.
5 = A generated value based on calculated data.
6 = A generated value based on estimated data.
7 = A generated value based on forecasted data.

Bits 3-6 are not defined at this time.

Bit 7 = 0 indicates the data is valid, 1 indicates the data is invalid or missing.

Units code derivation

UED allows for a variety of units codification schemes, including several standardize units systems, or the option use create a user definable system.

The standard coding scheme used by all the records in the database is specified using the Coding scheme record.

Irrespective of the codification scheme, the following will always be true:

Deriviation of standard units system.

The following standard units codification systems are available:
UED physical property system
This is original UED units coding. It is based on the idea that units generally are associated with the same physical property characteristics as the variables they are attributed to, and thus have a very similar codification derivation. Consequently the units tend to be unique for a given variable. For example the units code for temperature of air is not the same as the units code for water temperature, or soil temperature. Similarly the depth of water or not the same as depth of soil.

This system incorporates both Metric and English units code derivation as standard. Allowing both systems to be used in the same file.

UED generalized units system. This systems is based on International standard system of units (SI). It is based on the idea that a unit is the same no matter what variable it is applied to, I.e. degrees Celsius is the same for all temperatures using Celsius as the units of temperature.

This system allows for simplification of automatic units conversion done by UED C++ classes when accessing UED data record values. Indeed, the UED metric physical property system units are converted to the generalized units codes to perform the conversions in that system.

This system also allows for facilitating composition of derived units codes. For example, the units m/s is composed of a numerator in meters and a denominator is seconds. This simplifies creating many derived units without having to explicitly create a definition for each combination of units.

This system only recognizes the SI system as standard. If you wish to include records with English, you will have to define the English units using non-standard codification.