![]() |
|
|
|
| ||||||
|
Welcome to the The ProgrammersTalk Community forums. You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today! If you have any problems with the registration process or your account login, please contact contact us. |
![]() |
![]() | | LinkBack | Thread Tools | Display Modes | ![]() |
| ||||
| CSV Database Class Hey all. I've got a project on which I'm working and for which I need to be able to store non-binary data in a scalable, efficient fashion without the use of a database such as SQL lite or SQL Server. To do this, I've come up with the following methodology which I'm implementing. As I implement it, I'll post the code here (should be only over the next couple of days). Your feedback and suggestions are welcomed and encouraged. Let's start with the thought process: Definitions used below: database: A collection of CSV files. table: A single CSV file. Assumptions: I do not need all functionality of a database. I will have multiple databases of information with the same structure and different values. While I could store all the information in a single database and just add in a field to indicate the row's database, this would subsequently make the CSVs grow far too large over a 10 year period for efficient data access, so subdividing is preferable. Such subdivision applied to very large tables may be warranted, but I'm rather doubting that it'd be necessary or productive while it would make the coding more challenging, but, again, your feedback would be very much appreciated. Now, each table could grow fairly large over time (some will be very small, but some will be very large), so some form of indexing is appropriate. Additionally, each row will need a primary key (i.e. an autoincremented value) which shall always be the first value in the row. No support for joins or compound keys will be added as this project will need to be finished fairly soon and the advantages provided by such constructs don't warrant implementation for this particular project at this time. Row data shall be returned as an ArrayList. The header names will be available separately as an ArrayList where the index in the header ArrayList matches the index in the row data's ArrayList. Row data which contains newlines will not be allowed to wrap to the next line. Instead, newlines will be replaced with the XHTML entity <br />. Proposal: 1) Each database, being a collection of tables will actually be a directory. 2) Within each database directory there shall be 2 system subdirectories: _indexes and _table_summary. 3) In the _indexes directory there will be a file for each table with an index named as table_name_column_name. 4) Each index file will have rows of a fixed length which contains the following information in this order: column_value,primary_key,row_start_byte. Each primary key will be of type uint and the start byte shall be of type uint. The type for column_value will be determined by the column's type. In the case where the index is for the autoincrement value (i.e. the primary key), this file will not contain the column_value field. 5) In the _table_summary directory there will be a file for each table named the same as the table itself. 6) Each table summary file shall contain the following information in this order: next_auto_increment,row_length,header_row,indexed_columns. Here's the type of each of those values: next_auto_increment: uint row_length: uint header_row: csv Each header row will have the following characteristics stored here: name,column_type,max_length. name shall be of type uint,column_type shall be of type string (for use with Type.GetType()), and max_length shall be of type uint with 0 indicating no fixed length or that fixed length is not applicable. To store in a CSV format, these fields will be joined by a colon. For example,indexed_columns:csv 7) In the root of each database directory shall be a file for each table without the header row (note that the header information is stored in the summary file and would subsequently be redundant). 8) Unless absolutely necessary all tables will be composed of fixed length records to speed data access. It is, of course, not a requirement of C# that strings be of a maximal length -- this is done only for speed of data access. Unresolved Issue: In inserting values into an index, I would prefer to have the index ordered by the column value. However, I'm not sure if C# allows random file access (going forward and backward) to facilitate a binary search algo and I'm not sure about the calculational complexity (think "Big-O" here) of inserting into the middle of a file versus appending and dropping the binary search. Your ideas and suggestions are very much appreciated. For the tables themselves, I plan on appending data and then storing the start byte which will subsequently be available in the index. Stubbed Code: This C# code provides the structure of a class designed to facilitate such interaction. In addition to the above, a static class has been added which will be responsible for parsing a single line of CSV data and returning such data in an ArrayList. I have some of the code for the class below constructed for a separate, less structured class which will be replaced by this methodology, but have not ported it into this class. Code: using System;
using System.Collections;
using System.Collections.Generic;
using System.Text;
using System.IO;
namespace MyCustomNamespace {
class csv_db {
private string database_path = "";
private string current_database = "";
private ArrayList current_rows = new ArrayList();
private ArrayList current_indexed_columns = new ArrayList();
private csv_table_summary current_table_summary;
public csv_db(string databasePath) {
this.database_path = databasePath;
}
public bool connect(string databaseName) {
//returns true if database exists, false otherwise
return false;
}
public uint insert(Hashtable rowValues) {
//returns autoincrement value of new row
return 0;
}
public bool update(uint primaryKey, Hashtable rowValues) {
//returns true if row updated, false otherwise
return false;
}
public uint update(Hashtable whereValues, Hashtable rowValues) {
//returns number of rows updated
return 0;
}
public uint query(Hashtable whereValues) {
//returns number of rows matching query
return 0;
}
public bool delete(uint primaryKey) {
//returns true if row deleted
return false;
}
public uint delete(Hashtable whereValues) {
//returns number of rows deleted
return 0;
}
public bool create_db(string databaseName) {
//returns true if db created
return false;
}
public bool create_db(string databaseName, string sourceDatabaseName) {
//returns true if db created
return false;
}
public bool create_table(string tableName, ArrayList columns) {
//returns true if table created
return false;
}
public bool create_table(string tableName, ArrayList columns, uint rowLength) {
//returns true if table created
return false;
}
public bool index_insert(string columnName) {
//returns true if index created
return false;
}
public static ArrayList parse_csv_row(string csvRow) {
//parses a line of CSV data and returns as an ArrayList
return new ArrayList();
}
}
class csv_column {
private string name = "";
private Type column_type;
private uint max_length = 0;
public csv_column(string columnDetails) {
}
public string columnName {
get {
return this.name;
}
}
public Type columnType {
get {
return this.column_type;
}
}
public uint maxLength {
get {
return this.max_length;
}
}
}
class csv_table_summary {
private uint row_length = 0;
private uint next_auto_increment = 0;
private ArrayList headers = new ArrayList();
private ArrayList indexed_columns = new ArrayList();
public csv_table_summary(string summaryFileName) {
//Parse file and populate variables
}
public uint rowLength {
get {
return this.row_length;
}
}
public uint nextAutoIncrement {
get {
return this.next_auto_increment;
}
}
public ArrayList tableHeaders {
get {
return this.headers;
}
}
public ArrayList indexedColumns {
get {
return this.indexed_columns;
}
}
public bool isIndexed(string columnName) {
if (this.indexed_columns.Contains(columnName)) {
return true;
} else {
return false;
}
}
}
} Last edited by TeraTask : 09-21-2007 at 09:14 PM. Reason: Updated code to use some Hashtables |
| |
| |||
| I like the idea of this: public uint nextAutoIncrement { get { return this.next_auto_increment; } You get props from me ![]() } __________________ Day Cares | Golf Courses | Disc Golf Courses | Campgrounds | Ice Rinks | Paintball Fields | Dentists | Plastic Surgeons | Aging Jokes Catholic Churches | Lutheran Churches | Methodist Churches | Episcopal Churches | Clean Jokes |
| ||||
| Here's the usage code: Code: using System;
using System.Collections;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
//uint is 32 bits = 8 bytes
namespace MyCustomNamespace {
public partial class MainScreen : Form {
public MainScreen() {
InitializeComponent();
}
private void MainScreen_Load(object sender, EventArgs e) {
//For testing, create a database called _system
csv_db loadDb = new csv_db(Environment.GetFolderPath(Environment.SpecialFolder.CommonApplicationData)+"\\SMARTSystemsPro\\");
loadDb.create_db("_system");
loadDb.connect("_system");
ArrayList columnHeaders = new ArrayList();
columnHeaders.Add(new csv_table_header("my_table_id","uint",0));
columnHeaders.Add(new csv_table_header("name","string",100));
loadDb.create_table("my_table", columnHeaders, 8 + 100 + 2);
}
}
} |
| ||||
| I think I found a problem and would like your feedback on it. If I access fields by their byte position in the file, then records need to be stored in byte form. When working with numbers, this could mean that a newline (byte 13) is stored to represent 1101 which would happen, for example, on record 13. This could be a real pain as the records would not then be CSVs. Subsequently, I'm thinking of not storing by byte and just storing strings and then reading in one line at a time, comparing the first field to see if it contains the primary key and looping through the whole file that way. Does anyone see a more efficient means of accessing data randomly while addressing problems of bytes? Am I missing something? Am I making it too complicated? Your feedback and suggestions are very much appreciated. |
![]() |
| Thread Tools | |
| Display Modes | |
| |