I'm nibbling around the edges of kicking off a new project. Still doing the research and due diligence parts, but it's starting to solidify enough that I'm more or less down to picking out specific tools to at least start actually planning (how much planning winds up happening up front depends on a lot of different factors...at this point, I'm still not sure whether this will wind up being commercial, open source, or a beautiful combination of the two).
Whichever way I wind up going, two of the most important considerations are "cheap" and "simple."
One of my first tentative steps involved choosing a persistence engine. I started out dithering between postgresql and firebird. MySQL dropped off my radar when Oracle started messing with it and Java. As far as open source RDBMSs go, I've honestly always preferred firebird, if only because its feature list is jaw-dropping (in case you haven't realized this yet, I'm far from having any meaningful qualifications as a DBA. I can sling me some SQL, but that's about it). But postgresql is so ubiquitous that I wound up installing it first to kick its tires yet again (I haven't actually messed with any RDBMS in ages).
Then I logged into the console to start defining schema, and realized it was completely and totally the wrong tool for the job.
This project is really a total experiment in exploratory programming. The requirements are nothing but a bunch of vague ideas swirling around inside my mind. Up-front planning and defining database schema just do not fit at all, yet.
Maybe I've been corrupted by all the time I spent mucking around with Google App Engine and its interface over BigTable. All you do there is define classes that inherit from Model, create instances of them, then save them. Query, delete, and update as you like. Change Model properties (AKA column definitions) on the fly. The only real restrictions are keeping queries simple and pretty much forgetting about normalization.
Yeah, it's completely and totally a different mind-set from the RDBMS approach.
I may very well wind up hosting the web server portion on GAE. Like everything else, it depends on a lot of different factors. But I'll really need some sort of client-side persistence layer. If nothing else, different aspects need to work when the client isn't connected to the internet.
I vaguely remembered a project I ran across a few years ago, called CouchDB. It was all about simplicity, and seemed to fit well with my requirements. So I looked it up. And that led to a couple of weeks worth of [spare time] research into the different NoSQL (what a horribly misleading name...the "No" is reportedly an acronym for "Not Only") offerings.
Project Voldemort is extremely tempting. If only it weren't written in Java (that's one of the few decisions that I've actually been made so far...no Java! Well, unless something drastic pops up and convinces me that Clojure is mature/stable enough for my needs).
Cassandra probably deserved more attention than she got from me. But it seemed like no one's giving her any respect/attention. Hmm...googling directly turned up a ton of recent results. Shows just how flawed/spotty my research has been. Doesn't really matter...like most of the NoSQL offerings, Cassandra seems focused on "Big Data." I'm looking for a simple programming model that makes it easy for a project to evolve.
So I wound up trying to pick and choose between CouchDB and MongoDB. Tons of comparisons have been written between the two. They basically seem to boil down to "Querying CouchDB is weird, because you pretty much have to come to grips with Map/Reduce. MongoDB can trash your database if, say, the power fails." I'm fine with Map/Reduce, and I take an extremely dim view of losing data.
So I got set to install CouchDB. Downloaded it and read the Linux installation instructions. No freakin' way. This thing has more dependencies than that hooker who offered me a $7 blowjob. Out of curiousity, I read the Windows installation instructions. Cygwin, curl (but *not* the version that comes with cygwin), VS 2008...for this project, I have to care about Windows. Which totally destroyed CouchDB as an option.
So I took a step back and tried to get a look at the bigger picture. All the buzz around NoSQL, exobyte-scale databases, high performance, sharding, replication...none of that really mattered all that much. Well, replication is pretty important (another of the huge points that really keeps MongoDB from being feasible here). But none of the rest really matter at this point. If and when they do, it'll be a good problem to have.
Right now, I really just want some stupid-simple data persistence layer that lets me modify my data models arbitrarily, without having to update every other model of the same "kind" or in the same "table" in the database. Preferably that doesn't involve starting some sort of server. I hated the years I spent using Access as the database backend, but that was pretty much exactly the sort of programming model I want now. Not Access's front-end GUI pieces. Just a single database stashed in a file that I can manipulate programmatically using some sort of simple library.
I'd pretty much decided to just roll my own and started thinking about implementation details. How hard can it be? I'm not worried about big data sets at this point. Just have a file that I can keep in memory and save to disk periodically. If the data size gets noticeable, and memory even looks like it might start being an issue, I can always break it into a B-tree. And then optionally add indexes for something like CouchDB's Map/Reduce to create views. Then there's replication to consider. And avoiding data corruption in case there's a crash. What about cases where the data is more suitable for a hash table than a B-tree? Or a crazy self-balancing self-ordered tree that I ran across a couple of gurus discussing on Usenet a few days ago (sorry about not having a link. If enough people care, I'll try to look it up. It was on comp.lang.lisp).
Those were just the first issues that popped into my head immediately. This is, obviously, a hard problem that tons of really smart people have spent decades trying to solve. Something like flat files would work fine for this particular use case (at this point it's just a TODO list...I really am just experimenting and selecting technologies right now). But the point of investing time up-front is to pick technologies now that will make life simpler in the future.
Common Lisp's Elephant looks very promising. Except that Quicklisp can't install it. Quicklisp has changed the way I approach and think about programming. If I wind up using common lisp on this project (right now, it's got a strong lead over the other options), Quicklisp compatibility is pretty much a requirement.
Hmm. One of Elephant's back-end possibilities is Berkeley DB. I almost started an Access (the GUI parts) clone over Berkeley DB several years back (feel honored...I rarely admit something that embarrassing in public). It's one of the earliest "NoSQL" databases, from back before the phrase was cool. It's been around for decades. Its capabilities read like a shopping list for my requirements.
It's owned by freaking Oracle. GAAHHH! They've pretty much always been more evil than Microsoft, and they're looking ever more like the Borg of the Open Source world.
So I'm still pretty much exactly where I started. Simple open source data persistence options pretty much suck. The closed source/proprietary world really doesn't look that much better. Well, except for the eye candy...so I suppose the options do "look" better. But that's just on the surface.
EDIT: I think I found the solution to evolutionary data persistence with common lisp. It isn't perfect...I'll probably swap it out for Elephant with a PostgreSQL back-end when my requirements firm up, but it seems to be a nice fit for my current requirements.