Many text and binary file formats are already available. For example, it seem each weather data service has its own format. but a number of problems/deficiencies exists with these formats.
This article discusses a general binary weather file format which attempts to avoid the problems with the existing binary and text file formats.
UED is currently implemented as a binary file format in deference to the advantages afforded by the structuring of the binary data. There are some instances where the advantages of a text base file format (particularly portability and ability to view/edit the data with a text editor as mentioned below) may be preferable to the disadvantages (Particularly storage requirements). It is feasible to create a text based format with similar features for annotation and efficient data structuring. Please contact rnelson@wsu.edu if you are considering creating a text base format with the features of UED.
Binary data files are not human readable: in order to view the data, a specific program must be written to allow people to view/edit the data.
Relational databases store data in tabular data structures within the database system. Data is viewed/edited using specific database software.
One might first ask: why favour one format over the other if the data can be equally represented in any of the formats. The following table illustrates the advantages and disadvantages of using text or binary file formats:
Feature | Text | Binary | Rationale as applies to UED |
Readability by humans | + Easy to read | - Effectively impossible to read | While it would certainly be advantageous to be able to easily read data files directly without the need for special software, environmental time series data often involves large data sets; the analysis of the data will usually already require specialized visualization software. |
Viewing software | + Requires no special program | - Requires a program specifically designed to read the data | |
Editability | + Can usually edit with a text editor
- Users can introduce errors in editing |
- A special program is needed for editing
+ The editing software can help prevent errors + The editing software can provide features to help simplify editing data |
There are few situations where end users will need or even want to edit data values; even when the initial data recording is usually in some computerized manner (Due to recording equipment failure or false readings). Data editing programs are preferable to editing data with a text editor. In cases where it is necessary to enter data from text, conversion programs can be provided for the user. |
Security | - There are often situations where data providers don't necessarily want users to be able to directly view and/or edit data. | + Users are less likely to attempt to directly modify binary data when they can't easily read it. | For example, weather data sets might be considered "Read only" for end users. |
Parsability | - Depending on the layout of the data, text files can be difficult to read since users might introduce various spacing or use different delimiters. These different possibilities must be accounted for when reading the data file. | + Because binary files are always written by programs, there will always be a consistent format, so it is much easier for the program to read the data. | |
Annotation | + It is easy to add comments to text.
- The data parse must be able to accommodate annotations. |
- While provisions for annotations can always added to binary data file formats, annotations of variable length can disrupt the simplicity of the binary file format, especially when computing relative record addresses. | The binary file format presented here will provide options for marking data as valid, estimated, generated etc. in a consistent, and efficient way. |
Identifying the format when no documentation is available | + A user with some knowledge of the data might be able to reconstruct the format of the file by viewing the data. | - With no knowledge of the format of the data it is virtually impossible to identify the format of the data, | The binary format presented here will use a self-documenting format encoding that will allow future extension and modification of the format as needs arise. |
Encoding the format in the file itself | - Some how encoding the format of the file in the data file itself may disrupt the simplicity of the text file format potentially making the file more difficult to read by people, and also more difficult for programs to parse. | + The format of the file can be easily encoded in the file itself without significantly adding the the complexity of parsing the data. | |
Portability | + Text data is usually ASCII text readable by most computer software. | - There are a number of floating point representations and conversions
may need to be done.
Also, computer processors are either big endian or little endian data values wider than 1 byte may also require conversion. |
While a good text file format allows weather data to be imported to a variety of application programs such as spreadsheets and statistical software, A good binary file format with the many advantages that binary format provides will be adopted by both data providers and software that uses weather data. Binary file formats are generally easier to program especially in cases where text data might potentially be improperly formatted. |
I/O speed | - Slow : Converting text to/from internal binary representation is a rather time consuming process | + Orders of magnitude faster than reading/writing text. | The binary format presented here attempts to utilize, to the fullest, the advantages of fast I/O speed and access and efficiency of space avoiding superfluous and redundant data. |
I/O Access | - Access to individual records usually requires sequential I/O. since text file formats tend to have free form layouts, it is often more difficult to make changes to data in place, requiring a second completely new file to be produced in order to store any changes. | + Binary data is usually formatted such that record addresses can be easily computed for fast file pointer relative file access. | |
Floating point precision | - Providing additional precision requires additional file size. | + The binary floating point representation is optimized to allow the most precision possible for any given floating point type of a given size. | |
File size | - Text representation of data always requires much more file space
then binary data.
- Additional space is required for greater precision of floating point data. - Additional space is required to make the presentation of the data more appealing and easier to read. |
+ Binary data is the most compact representation of data.
This, on average,
means binary files are more than half the size of text files and most relation database tables.
+ Compression algorithms can be applied to binary data. |
This can often further cut in half file size requirements.
The current UED format does not provide specifications for data compression, but this could be added latter if needed. |
Each record is prefixed by a 6 byte header:
The record header allows old programs to be able to still read and use data files even if new record types have been added to the file, this satisfies the requirement of extensibility Programs may skip over records of unknown type.
The binary format of the record type word.
msb lsb +-+-+-+-+-+-+-+-+ | | | | | | | | | +-+-+-+-+-+-+-+-+ 7 6 5 4 3 2 1 0 | +-Bit 7 is always 0 for a standard record type. Programmers desiring to create new record types for specific appliation must use bit 7 set to 1 (See below). If programmers intend to make their record type part of the standard, they must follow the record type encoding that follows. | +- 0|0 Bit 6 = 0 always indicates the record contains a data set or value irregardles of whether the record type is standard or an application specific record type. | +- 0 0|0 Bit 5 = 0 Indicates a single value or group of values for a time period. | +- |0 Bit 4 = 0 indicates a single scalar variable |1 Bit 4 = 1 indicates an array or set of values (I.e. layering record) Bits 0-3 apply to all data sets or data values regardless of the value of bits 4 and 5, as described below. | +- 0 0|1 Bit 5 = 1 Indicates a set of values (time series) for a time period The data record will contain a time step indicator | +- 0 0 0|0 Bit 4 = 0 is currently always 0 0 0 0|1 Bit 4 = 1 is currently undefined (reserved for future use) (records are now always time stamped Bits 0-3 apply to all data sets or data values regardless of the value of bits 4 and 5. | | +-+ Bits 2-3 Denote the time stamp indicators used (applies to bit 3 either 1 or 0). The following time indicates have been defined 0 0 D D 0 0 Data stamped with year only. 0 0 D D 0 1 Data stamped with time only. 0 0 D D 1 0 Data stamped with a date only. 0 0 D D 1 1 Data stamped with a date and time. | +- 0 0 D D D D|0 Bit 1 = 0 indicates the record contains a variable code 0 0 D D D D|1 Bit 1 = 1 indicate the variable code has been omitted The variable code will only be omitted if a global variable control code has been previously applied. | +- 0 0 D D D D D|0 Bit 0 = 0 indicates the record contains a variable units code 0 0 D D D D D|1 Bit 0 = 1 indicates the variable units code has been omitted. The units code will only be omitted if a global variable units control code has been previously applied. | +- 0|1 Bit 6 = 1 always indicates the record contains information that is not a data set. (I.e comments or control) | +- 0 1|0 Bit 5 = 0 indicates the record type is a control record type | +- 0 1 0|0 Bit 4 = 0 indicates control applies until over ridden | +- 0 1 0 0|0 Bit 3 = 0 indicates a marker Bits 0-1 denote the 7 marker codes 0 1 0 0 0|0 0 0 Beginning of file 16 bit standard revision number follows size 0 1 0 0 0|0 0 1 0 1 0 0 0|0 1 0 0 1 0 0 0|0 1 1 0 1 0 0 0|1 0 0 0 1 0 0 0|1 0 1 0 1 0 0 0|1 1 0 0 1 0 0 0|1 1 1 End of file | +- 0 1 0 0|1 Bit 3 = 1 Not yet defined 0 1 0 0 1 x x x | +- 0 1 0|1 Bit 4 = 1 indicates a global control (applies to whole database) | +- 0 1 0 1|1 Bit 3 = 1 Commentary record Bits 0-2 denotes commentary information 0 1 0 1 1|0 0 0 A General comment (01011000) (followed by null terminated string) 0 1 0 1 1|0 0 1 Description of the database (followed by a null terminated string) 0 1 0 1 1|0 1 0 Information about the generating application (followed by a 16integer and a null terminated string) 0 1 0 1 1|0 1 1 reserved for future use 0 1 0 1 1|1 0 0 Location information (Followed by location information record) 0 1 0 1 1|1 0 1 reserved for future use 0 1 0 1 1|1 1 0 reserved for future use 0 1 0 1 1|1 1 1 reserved for future use | +- Bit 3 = 0 Not yet defined 0 1 0 1|0 x x x +- 0 1|1 Bit 5 = 1 Indicates a definition +- 0 1 1|0 Bit 4 = 0 indicates a specification Bits 0-3 denote the type of definition 0 1 1 0|0 0 0 0 Variable format 0 0 1 1 0|0 0 0 1 Variable format 1 (reserved for future use) 0 1 1 0|0 0 1 0 Variable format 2 (reserved for future use) 0 1 1 0|0 0 1 1 Variable format 3 (reserved for future use) 0 1 1 0|0 1 0 0 Units format 0 0 1 1 0|0 1 0 1 Units format 1 (reserved for future use) 0 1 1 0|0 1 1 0 Units format 2 (reserved for future use) 0 1 1 0|0 1 1 1 Units format 3 (reserved for future use) 0 1 1 0|1 0 0 0 Conversion function format 0 0 1 1 0|1 0 0 1 Conversion function format 1 (reserved for future use) 0 1 1 0|1 0 1 0 Conversion function format 2 (reserved for future use) 0 1 1 0|1 0 1 1 Conversion function format 3 (reserved for future use) 0 1 1 0|1 1 0 0 (reserved for future use) 0 1 1 0|1 1 0 1 (reserved for future use) 0 1 1 0|1 1 1 0 (reserved for future use) 0 1 1 0|1 1 1 1 (reserved for future use) +- 0 1 1|1 Bit 4 = 1 indicates a data format definition 0 1 1 1|x x x x The bit layout for this format has not yet been defined yet. | +- Bit 7 Set indicates a non-standard record type If Bit 7 is set, bits 0-6 can be defined any way by an application, however, if the user wishes to later have the data type standardized, the bit format should follow as if the type if it was a standard type.
8 7 6 54 3 2 1 X 0 0 00 1 X X
int16 | year | The year of the data set | |
uint32 | variable code | May be omitted if bit 2 set | |
uint32 | value units code | The units of the values | May be omitted if bit 1 set |
byte | attribute | This summerizes the quality of the data for the year Note that multiple attribute bits can be set. | |
byte | time step units code | I.e. Year (Month not implemented) Day Hour Second | |
int16 | num_values | The number of values | |
float | values[num_values] | ||
byte | attributes[num_values] | (present only if attribute not = 0) |
Date | date | |
uint32 | variable_code | |
uint32 | value units code | The units of the values |
byte | attribute | This summarizes the quality of the data for the year Note that multiple attribute bits can be set. byte time step units code I.e. Day Hour Second |
uint16 | num_values | The number of values |
float | values[num_values] | |
byte | attributes[num_values] | (present only if attribute not = 0) |
They have have the following format:
Date | date | |
Time | time | will be 0 if daily value |
uint32 | variable code | |
uint32 | units code | |
byte | attribute | |
float | value |
They have have the following format:
Date | date | |
Time | time | will be 0 if daily value |
uint32 | variable code | |
uint32 | units code | |
byte | attribute | |
uint16 | num_values | The number of values |
float | values[num_values] |
For example a year record with time step unit Year with 4 values is a quarterly set of values. This is equivalent to a year records with time step unit month with 4 values
Bits 3-6 are not defined at this time.
Bit 7 = 0 indicates the data is valid, 1 indicates the data is invalid or missing.
0 indicates that values all have the same attributes as set in bits 1-4. In this case either 1 or none of the bits 1-4 will be set
The standard coding scheme used by all the records in the database is specified using the Coding scheme record.
Irrespective of the codification scheme, the following will always be true:
This system incorporates both Metric and English units code derivation as standard. Allowing both systems to be used in the same file.
This system allows for simplification of automatic units conversion done by UED C++ classes when accessing UED data record values. Indeed, the UED metric physical property system units are converted to the generalized units codes to perform the conversions in that system.
This system also allows for facilitating composition of derived units codes. For example, the units m/s is composed of a numerator in meters and a denominator is seconds. This simplifies creating many derived units without having to explicitly create a definition for each combination of units.
This system only recognizes the SI system as standard. If you wish to include records with English, you will have to define the English units using non-standard codification.