Welcome to the “Obscure” World of Databases

How would we picture the image of a non doctor, entering an operations room and performing a surgery? Probably pretty badly. After removing the scope of the consequences, to me this is very similar to the image of a non database person, designing a database.

For several reasons, people who have not been formally trained as computer scientists find themselves performing IT tasks, some of them very sophisticated. I don’t see anything wrong with people that were not formally trained embrassing an IT career (actually I am one of them), as long as… they do embrace it. That is: study or research the things they don’t know, and do care about standards of quality. If they don’t do it, then they are just bringing into “shame” the name of everyone who is working on the field, by bringing the level really down.

All these thoughts – that are actually recurrent in my life – were triggered by analysing what I thought it was a relational database. By going through it and refactoring it, I got a pretty good collection of the things that you should not do, when designing a database. These are some ideas that I would like to clarify; if you think that they are obvious, you would be surprised with what I saw.

Choosing a relational database management system (RDBMS), does not automatically mean that you have a relational database. A relational database is a database that is organized in terms of the relational model, as formulated by Edgar F. Codd. According to Wikipedia[1], the purpose of this model is to “provide a declarative method for specifying data and queries: users directly state what information the database contains and what information they want from it, and let the database management system software take care of describing data structures for storing the data and retrieval procedures for answering queries”. It is a requirement of the model that the users state exactly what the database contains, and describe properly how it is organized; only in this way, it is possible to query the database and obtain exact answers to our queries.

If the database is a repository of information, some of it unkown or unnecessary, and not really organized, that means that we are not modelling the data, but only storing information. The outcome of this, is that we cannot query the database and produce any usefull answers about our data (there are no miracles!). We can store the information, but not in a much better way than if we were storing it in a filesystem. Users of this repository may implement themselves ways of dealing with this data by creating relations on-the-fly, either by using the SQL engine, or by pullling the data out and do it somewhere else. However they are not gaining anything from the data model, as they have to figure out themselves how the data is organized, and there is no guarantee that any two people would not do it in a different way.

My first bullet point is that there is absolutely no point in using a relational database engine and *not* implement a relational model; it is probably a worst solution than using a filesystem, because it may pass the wrong idea to people: that there is a relational model.

If people are not implementing the relational model because they do not know the data, or because the data is not organized in a relational manner, this is an explanation I can accept. In fact I think for many cases (and some of them which are being approached relationally) the rigid structures of a top-down design are not the providing a good solution, because knowing everything (or at least a lot) about the data is a weak assumption. For this cases there is NoSQL[2], and specially the document-driven databases provide a very flexible model, close to what people implement with filesystems. In other words: this is ok, but then don’t use a relational database engine.

On the other hand, if people are not implementing the relational model because they don’t know it, then they should learn about it. They should learn about it, at least to be able to decide that they don’t use it and go for a NoSQL database. Finally if they do not want to learn about databases, or relational models, because they are “just” biologists or economists, that is also fine but then please don’t let them design a database, specially if it is one where you want to store valuable data. And I am afraid of the institutions who let this sort of thing happen.

Image

 

[1] http://en.wikipedia.org/wiki/Relational_database

[2] http://en.wikipedia.org/wiki/NoSQL

Post-Conference Notes

The conference was really inspiring, bringing lots of good ideas. Unfortunately I could not absorb everything – which I guess is normal in this kind of thing – but it cleared a lot the ideas from the previous point, and it let me “wandering” about some concepts, which could be the “staring point” for more research (and probably a few posts more!).

The antagonism between consistency and time was strongly enforced during the conference, which translates in the different models: transactional vs eventual consistency. The transactional is the only one implemented on  the “traditional” relational databases and eventual consistency is implemented in (some) NoSQL databases. Basically NoSQL gives you the freedom of not having to be “eventual consistent”, but of choosing to do so. And “why would you do such a thing?”

There is a theorem called CAP, which basically states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

    – Consistency (all nodes see the same data at the same time)
    -Availability (a guarantee that every request receives a response about whether it was successful or failed)
    – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

So if you go for a NoSQL solution, you basically may decide which two of these are important for you.

As we discussed in the previous post, eventual consistency translates in that “given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and all the replicas will be consistent” (if you are interested in some metrics, on “how long” may be this period of time you may wanna have a look at Probabilistically Bounded Staleness).

To put it all really simple; NoSQL’paradigm is:

“It is better to ask for forgiveness than for permission.”

And the “traditional” model, instead is:

“Better safe than sorry.”

Think nevertheless, that inside this “safeness” are “hiding” longer response times or intolerance to fault.

Apart from this important/major/core concept, there were some things that “caught my eye” in this conference. Namely:

– Cassandra database: which is the Apache, Open-source, eventually‐consistent key‐value store.

-MongoDB: that seems to have a lot of success stories, and even has some Geospatial indexing.

– The Apache Hadoop open-source software framework that supports data-intensive distributed applications like Facebook; and its “improved” version Storm, that is now used and owned by Twitter (although I could not understand very well, its complicated technical achievements!).

– Basho company and their highly distributed Open-source database (Riak).

– Dynamo, a proprietary, highly available key-value structured storage system that was developed  by Amazon (and implemented by Cassandra).

-The Singapore live project, which is an “almost” real-time data visualization project.

– A very interesting project, at the Barcelona Digital Technological Centre, looking at the mobility network in the city.

OMG! And I discovered that Twitter is storing the geolocations of people (including) me, by default, and providing them to thrid-party applications through their API. You can read more here.

So the immediate consequences of this conference in my life (apart from the fact that I drank lots of coffee! :-)) were:

– I installed MongoDB in my laptop;

– I removed the location stamp on my Twitts, by changing my Twitter settings.

– I got to know the BDigital 😉