This was a brief techincal corresponence describing why not to store large simulation model input/output data in relational databases.

There are three issues with respect to storing simulation model input/output in commercial relational database systems.

  1. The first issue is when you need to have generated data, are you thinking about generating it as you go along ("on the fly") or generating "up front" and storing it in a database for use.

    Before ClimGen, CropSyst "generated" (or rather "estimated") solar radiation "on the fly".

    We have decided that it is better to move all the data generation out of CropSyst. So the user is expected to have all the data prepare "up front" before running the model.

    We will be adding a check that if the user has specified dates for which there is no data available, ClimGen will be run to add the data required, but this is still technically done "up front" since the check is performed by the user interface, not the model itself.

    Now, it is reasonable to see that data be fed to the model on the fly since it would seem that you would not need to store the generated data in a database. But this has the following disadvantages:

    1. Each time you run the model, you will likely have different generated data (due to the random number generator) so if you need to run with the exact same weather data you would have to store the data in a database. Consequently in any case you are always going to end up storing your data in some sort of database of some form.
    2. If you want to know what the input data was (which is pretty important since it is basis of our models) you still need to store the data.
    3. Additionally for point 2 suppose that you are not regenerating data for each run (using the same set of daily weather data by disabling RANDOMIZE function, for example). For each run then you would be outputting that same data that doesn't change over and over again. This takes time. Although it may not be much time it does slow down the run. (this can be a real issue if you are using SQL which, I'll discuss below.
    So it would be apparent to me that no matter what you do, at some point, you are going to need to store the data in a database whether it happens up front or as output.

    In all my programming, I use the concept of "data sources". The datasource is analogous to data providers in a database system. The process of communicating data between files, database, or what ever are divorced from the code of the model itself. Essentially you can plug in any data source as the data provider. So, for example, the weather data provider can be a UED database data source, a text file data source, or (if you want) an "on the fly" generator serving as a datasource. The programs by telling the datasource "Here is a data record where I expect precip, max temp, min temp etc.." then for each record it says "up date the data record" This is extremely fast because the datasource simply updates the record in the most expedient manner the data source is capable of performing. In the case of CropSyst text file weather datasource, it creates input buffers to mitigate file access. In the case of UED it uses a combination very fast binary file access, and in memory buffers and indexes.

    My data sources classes are all based on a virtual abstract class, In COM terminology this is similar COM interface, but in C++ more generalized, inheritable and thus easier to use and maintain. This allows data sources to be virtually anything, from the CropSyst INI format parameter files, to commercial database (like Oracle, Access, Paradox, dBase), individual tables (such and my simple dBase compatible input/output object), my UED database class and COM servers such as Excel. It is relatively easy to write datasource subclasses for other formats. I am eventually hoping to have all CropSyst input and output to datasource (The input is done and the output is partially done since it was needed for LADSS, and currently used for fast graph output). I am almost finished with the report generating datasource to replace the current Excel spreadsheet generation performed at the end of the simulation run, rather that during. This should make the CropSyst run seem almost instantaneous in comparison to now).

    Anyway, to get back to the point. While you won't be able to implement datasources as I had done in C++ because V.B. and COM don't truly support inheritance. You can have a similar effect using interfaces. I think that you will find that as you make your design in UML, you will think of input and output in terms of data sources abstract from the model itself. Which brings me to the second issue.

  2. The second issue is getting the generated data to the application (simulation model).

    The problem with using a database system such as Access, via an SQL interface is that the communication between a program and the database requires a lot of processing of the data in order to get it in the format for storage. There is also additional overhead for data security, interfacing with data source drivers, etc. provided by the database system. These things are great for business type applications, and would even be great for modeling. I.e. it would be really nice to store the ClimGen generated data in a centralized database system so that you could easily use the data directly in a Spreadsheet or join it with other data in some sort of database query. Indeed I would like to provide an ODBC driver for my UED database format so you could do just that.

    For simulation runs, however, this overhead really adds up quickly when for each timestep of the simulation you need to output a lot of data. Consider in CropSyst, the daily output has about 80 scalar variables, then each soil profile could have at least a dozen sublayers. You could very easily output a couple hundred variables for each timestep.

    Consider what you have to do to send each variable to the database system with SQL. Which is what we did with LADSS, something similar would be done no matter what system or interface you are using - First you have to construct the SQL statement. In Oracle, using OCI, for example we compose the SQL statement by creating the statement (which is a character string) using intrinsic markers for where the output variables with be "printed". This is done using string concatenation. (a relatively time expensive operation).

    All this format conversion, and IPC is very expensive. I think Mike and Kevin are going to find their LADSS system running through Oracle to be very slow.

    Chris Robinson, is doing the programming for Dr. Saxton's SPAW model now. The SPAW model took weather data from text files and output text file reports. Chris put all this into Access. Which seems reasonable, because you get all the nice goodies.

    1. You don't have to worry about file formats you just set up your data sources (this is a good thing).
    2. You get all the nice data aware controls so it is easy to link up the data to the user interface.
    3. Easy to reports generators and graphics using Crystal reports etc..
    4. Easy to share data, with Excel, Word etc.
    But the program ran incredibly slow, he ended up returning the output to the text file format. (Which is already pretty slow in comparison to something like UED).

    I used to work in the "automatic data processing" department for the navy where I programmed using a database system, these systems work great for business settings because these applications tend to be transaction oriented, where you do one of three types of task:

      1. Input data (This does not require fast access because humans are pretty slow). 2. Interactive data queries and data display, where you want fast response times, much of the processing is done in background with fast indexes, data buffers etc.. to filter out only the data the user needs to display and he can only read the displayed data at sails pace any way. (Unlike a simulation model that would be waiting impatiently for the database to return all available data as quickly as possible). 3. Batch processing. When large amounts of data are processed, these are using done in the background and run unattended, overnight or what ever.

    Consequently, business applications don't need the speed from the database system that you will want to have for the simulation model.

    You mentioned doing the data queries using threading. Threading will not solve the speed issue, as a matter of fact, it will only compound the problem. While the program (climate generator, simulation or whatever) may be free from actually performing the bulk of data I/O, it must still do the set up for interacting with the database (I.e. the Oracle SQL OCI as I described above) This is actually more expensive than simply reading and writing a line in a text file! In addition, the computer, is still going to be running the thread doing the query.

    Threading has its own over head. Suppose each transaction (I.e. get the data for day such and such) requires a query, so you set up a thread for the query, While the thread can be creating in a few lines of code, there is actually a significant amount of processing that go on behind the scenes to create the thread process. Memory is allocation, a process control block is created, the thread is put in the process schedule, virtual memory tables are created. Creating a thread is VERY expensive. Then once the thread is created, every few millisecondes the each thread is swapped out from being the current process. This involves saving the thread's CPU context, virtual memory management, possible virtual memory reallocation etc, all of which much be swapped back in when the thread gets its turn at the CPU again.

    Finally, the application making the request, really is still waiting for the data the thread finish before it can continue. So threading does not buy you any performance gain. (Except possible if you have multiple CPU's but that is a whole different issue, and you don't want to get into parallel programing).

      A third issue is organizing the input and output. I won't go into a discussion of this, but it is very important. I find having the data (output in particular) stored in the data tables in the single database system to be an organizational mess when you think about working with lots of projects and lots of simulation runs, a hierarchical structure is more suitable than a relational one. Database systems are designed for more or less static data sets. So weather data would be more suitable to be stored in a database system, but simulation outputs would not.

      For example, suppose you have a simulation run where you ran for a certain number of years, then you run it again for another period of years. You would need to make sure to erase the period that you didn't run for for the second run. The database would need to run the query for the records to delete, update its indexes, perform "garbage collection" etc. Then you have the whole issue of clearing the tables for simulations which you are no longer using (whose data is what and when to delete it etc.). Then there is the issue of making data portable (I.e. running the data on another machine).