View Single Post
  #1 (permalink)  
Old 09-21-2007, 02:59 PM
TeraTask's Avatar
TeraTask TeraTask is offline
PT Admin
Awards Showcase
Quality Tutorial Quality Tutorial Quality Tutorial 
Total Awards: 3
Join Date: Jun 2007
Location: Reno, NV
Posts: 442
iTrader: (0)
TeraTask is a splendid one to beholdTeraTask is a splendid one to beholdTeraTask is a splendid one to beholdTeraTask is a splendid one to beholdTeraTask is a splendid one to beholdTeraTask is a splendid one to behold
CSV Database Class

Hey all. I've got a project on which I'm working and for which I need to be able to store non-binary data in a scalable, efficient fashion without the use of a database such as SQL lite or SQL Server. To do this, I've come up with the following methodology which I'm implementing. As I implement it, I'll post the code here (should be only over the next couple of days). Your feedback and suggestions are welcomed and encouraged.

Let's start with the thought process:

Definitions used below:

database: A collection of CSV files.
table: A single CSV file.

Assumptions:
I do not need all functionality of a database. I will have multiple databases of information with the same structure and different values. While I could store all the information in a single database and just add in a field to indicate the row's database, this would subsequently make the CSVs grow far too large over a 10 year period for efficient data access, so subdividing is preferable. Such subdivision applied to very large tables may be warranted, but I'm rather doubting that it'd be necessary or productive while it would make the coding more challenging, but, again, your feedback would be very much appreciated.

Now, each table could grow fairly large over time (some will be very small, but some will be very large), so some form of indexing is appropriate.

Additionally, each row will need a primary key (i.e. an autoincremented value) which shall always be the first value in the row.

No support for joins or compound keys will be added as this project will need to be finished fairly soon and the advantages provided by such constructs don't warrant implementation for this particular project at this time.

Row data shall be returned as an ArrayList. The header names will be available separately as an ArrayList where the index in the header ArrayList matches the index in the row data's ArrayList.

Row data which contains newlines will not be allowed to wrap to the next line. Instead, newlines will be replaced with the XHTML entity <br />.

Proposal:

1) Each database, being a collection of tables will actually be a directory.
2) Within each database directory there shall be 2 system subdirectories: _indexes and _table_summary.
3) In the _indexes directory there will be a file for each table with an index named as table_name_column_name.
4) Each index file will have rows of a fixed length which contains the following information in this order: column_value,primary_key,row_start_byte. Each primary key will be of type uint and the start byte shall be of type uint. The type for column_value will be determined by the column's type. In the case where the index is for the autoincrement value (i.e. the primary key), this file will not contain the column_value field.
5) In the _table_summary directory there will be a file for each table named the same as the table itself.
6) Each table summary file shall contain the following information in this order: next_auto_increment,row_length,header_row,indexed_columns. Here's the type of each of those values:

next_auto_increment: uint
row_length: uint
header_row: csv
Each header row will have the following characteristics stored here: name,column_type,max_length. name shall be of type uint,column_type shall be of type string (for use with Type.GetType()), and max_length shall be of type uint with 0 indicating no fixed length or that fixed length is not applicable. To store in a CSV format, these fields will be joined by a colon. For example,
Code:
  autoincrement:uint:0,name:string:50,street_address:string:255
indexed_columns:csv
7) In the root of each database directory shall be a file for each table without the header row (note that the header information is stored in the summary file and would subsequently be redundant).
8) Unless absolutely necessary all tables will be composed of fixed length records to speed data access. It is, of course, not a requirement of C# that strings be of a maximal length -- this is done only for speed of data access.

Unresolved Issue:

In inserting values into an index, I would prefer to have the index ordered by the column value. However, I'm not sure if C# allows random file access (going forward and backward) to facilitate a binary search algo and I'm not sure about the calculational complexity (think "Big-O" here) of inserting into the middle of a file versus appending and dropping the binary search. Your ideas and suggestions are very much appreciated. For the tables themselves, I plan on appending data and then storing the start byte which will subsequently be available in the index.

Stubbed Code:

This C# code provides the structure of a class designed to facilitate such interaction. In addition to the above, a static class has been added which will be responsible for parsing a single line of CSV data and returning such data in an ArrayList. I have some of the code for the class below constructed for a separate, less structured class which will be replaced by this methodology, but have not ported it into this class.

Code:
using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;
using System.IO;

namespace MyCustomNamespace {
    class csv_db {
        private string database_path = "";
        private string current_database = "";
        private ArrayList current_rows = new ArrayList();
        private ArrayList current_indexed_columns = new ArrayList();
        private csv_table_summary current_table_summary;

        public csv_db(string databasePath) {
            this.database_path = databasePath;
        }

        public bool connect(string databaseName) {
            //returns true if database exists, false otherwise
            return false;
        }
        public uint insert(Hashtable rowValues) {
            //returns autoincrement value of new row
            return 0;
        }
        public bool update(uint primaryKey, Hashtable rowValues) {
            //returns true if row updated, false otherwise
            return false;
        }
        public uint update(Hashtable whereValues, Hashtable rowValues) {
            //returns number of rows updated
            return 0;
        }
        public uint query(Hashtable whereValues) {
            //returns number of rows matching query
            return 0;
        }
        public bool delete(uint primaryKey) {
            //returns true if row deleted
            return false;
        }
        public uint delete(Hashtable whereValues) {
            //returns number of rows deleted
            return 0;
        }
        public bool create_db(string databaseName) {
            //returns true if db created
            return false;
        }
        public bool create_db(string databaseName, string sourceDatabaseName) {
            //returns true if db created
            return false;
        }
        public bool create_table(string tableName, ArrayList columns) {
            //returns true if table created
            return false;
        }
        public bool create_table(string tableName, ArrayList columns, uint rowLength) {
            //returns true if table created
            return false;
        }
        public bool index_insert(string columnName) {
            //returns true if index created
            return false;
        }
        public static ArrayList parse_csv_row(string csvRow) {
            //parses a line of CSV data and returns as an ArrayList
            return new ArrayList();
        }
    }
    class csv_column {
        private string name = "";
        private Type column_type;
        private uint max_length = 0;

        public csv_column(string columnDetails) {
        }

        public string columnName {
            get {
                return this.name;
            }
        }
        public Type columnType {
            get {
                return this.column_type;
            }
        }
        public uint maxLength {
            get {
                return this.max_length;
            }
        }
    }

    class csv_table_summary {
        private uint row_length = 0;
        private uint next_auto_increment = 0;
        private ArrayList headers = new ArrayList();
        private ArrayList indexed_columns = new ArrayList();

        public csv_table_summary(string summaryFileName) {
            //Parse file and populate variables
        }

        public uint rowLength {
            get {
                return this.row_length;
            }
        }
        public uint nextAutoIncrement {
            get {
                return this.next_auto_increment;
            }
        }
        public ArrayList tableHeaders {
            get {
                return this.headers;
            }
        }
        public ArrayList indexedColumns {
            get {
                return this.indexed_columns;
            }
        }
        public bool isIndexed(string columnName) {
            if (this.indexed_columns.Contains(columnName)) {
                return true;
            } else {
                return false;
            }
        }
    }
}

__________________
Jeremy Miller
Content Farmer - Optimized Automated Blog Posting

Digg this Post! Del.Icio.Us this Post! Technorati this Post! Furl this Post! Mister Wong this Post! Newsvine this Post! Spurl this Post! Reddit this Post! Netscape this Post!

Last edited by TeraTask : 09-21-2007 at 09:14 PM. Reason: Updated code to use some Hashtables
Reply With Quote