Data Analytics Bootcamp

I have always dreamed about doing some contribution towards improving the gender balance in technology, which as you may know, is far from ideal.

Fortunately the opportunity arose, when Katrina Walker has invited me to teach the “Data Analytics”  bootcamp at CodeOp, an international code school for women and TGNC individuals.

Over the 6-month course, I will share my hands-on experience with the various stages of the data analysis pipeline, specifically on how to apply various technologies to ingest, model and visualize data insights.

Rather than focusing on a specific technology, I will leverage on the “best tool for the job, approach”, which is what I do when I want to analyse data. This means learning different tools, such as Python, R, SQL or QGIS, and often combine them together.

For me “data analytics” is like a journey, where we start with a high-level problem, translate it into data and algorithms, and finally extract a high-level idea. At the start and the end of journey, we should always be able to communicate with people that are not “data geeks” and this is one idea that I would like to pass in the course.

I will not add anything else, apart that I am really excited to get started!

codeops2

Importing Large Spatial Datasets into PostGIS

Is PostgreSQL/PostGIS suitable for Big Data? This is a question I have been asking myself for a while, and I now have the opportunity to test it with a PostgreSQL instance from Amazon Web Services (AWS); after all, if such a service cannot deal with “Big Data”, who can?
I have a dataset with Tweets in the city of Barcelona, over a time period. The entire dataset amounts to about 3 million points (82 MB), and I have some subsets: a smaller collection of about 2 million points (55 MB) and 300 000 points (8.2. MB).

big_data

We can argue whether this dataset is, or not, Big Data; but IMHO, a table with more than 1 million records starts to get “heavy” for any relational database. Now before doing any computations of spatial algorithms (e.g.: convex hull, point in polygon), the very first step is to actually load this dataset into the database. And this is where I started to struggle.

QGIS has a plugin called SPIT, that allows to import a Shapefile into PostGIS. Unfortunately it seems to be extremely slow any of the three files that I mentioned above (including the smaller one).

spit

Switching to the command line, where I have more control knowledge about the processes that are running, I tried shp2pgsql, a tool for importing Shapefiles that comes with the PostGIS installation.

On Linux it is possible to convert and import the Shapefile in one go, redirecting the output with pipe:

shp2pgsql -s <SRID> -c -D -I <path to shapefile> <schema>.<table> | \
psql -d <databasename> -h <hostname> -U <username>

Since this command was really slow (unresponsive?) I decided to do it in two steps, to see where it was hanging:

  • Create an sql file with the insert statements.
  • Load this sql file into the database.

Unsurprisingly, the SQL file with the statements was generated quite quickly, and the processing effort was being spent on the insert queries. This situation was verified with the three Shapefiles, even when I inserted a spatial index (Gist). After thinking a bit, I came to the conclusion that the spatial index should be created within the “CREATE TABLE” statement, so it should actually speed up the queries a bit (but I did not confirm this guess).

My next attempt, was to use ogr2ogr, the GDAL tool to convert between vector formats.

ogr2ogr -f "PostgreSQL" PG:"host=host user=user dbname=db password=pass" data.shp

I have to say that I read some suggestion to prefer shp2pgsql over this tool, and in fact it was quite slow as well (at this point I was only testing the smallest of the Shapefiles).

My conclusion up to this moment, is that it is quite expensive in terms of time (and processor) to import a large dataset into a PostGIS. Some ideas that could be explored:

  • Upgrade the PostGIS server (would scale vertically make any difference?).
  • Upload the data into the Amazon server and run the import tool from there. I am not sure whether it is possible to do that, but it seems like a good improvement to do this operation locally, and using the power of the Amazon Server, rather than my desktop computer.
  • Split the Shapefile into chunks of data. I am not entirely sure how to do this, and any suggestions would be welcome.

Next steps: once I manage to import a large Shapefile into my PostGIS instance, I will post some feedback about the computation of spatial operations.

Update: Apparently I underestimated the fact that I am importing data over a network connection, with a limited latency. The best thing would be to make the import locally, an option that I unfortunately don’t have, since the Amazon RDS does not give me SSH access (it is a service, not a machine!). I decided to switch to a new approach: import a plain csv file into a database table, using the /copy command, and that was actually really fast (a few seconds).

psql database -U user -h host.eu-west-1.rds.amazonaws.com -c "\copy newt_table from 'data.csv' with DELIMITER ','"

Of course, I then had to run a query populate the point geometry from the float (x,y) values, and create an update a spatial index (for a matter efficiency it is recommended that indexes are created after, and not during the import!).

UPDATE new_table SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

CREATE INDEX [indexname] ON [tablename] USING GIST ( [geometryfield] );

VACUUM ANALYZE [table_name] [column_name];

That was still considerably fast (although not as much as the csv import!), and my guess is: it is because I are now running queries on the remote machine (and not uploading data via an internet connection).

Following some very usefull on Stack Overflow, I decided to retry my previous approaches, tuning the configuration parameters.

First I tried the OGR command line tool, but this time with the config flag “PG_USE_COPY YES”, to force dumping of the data:

ogr2ogr -progress -f PostgreSQL PG:"dbname='database' user='user' host='host.eu-west-1.rds.amazonaws.com' port='5432' data.shp --config PG_USE_COPY YES -nlt POINT -nln import_ogr

Although speed has improved massively, it still took me 1:03:00 to run this query.

Then I tried the shp2pgsql tool, with the “-D” flag, which also forces the dump of data:

shp2pgsql -D -s 4326 data.shp import_shp2pgsql | psql

This was remarkably fast: 00:01:04

compare_imports

Comparing the two imports, we can see that shp2pgsql did not create any index (which we already commented is a costly operation), while ogr2ogr created a gist index on the spatial field. The extra step of creating an index, has to be added to the shp2pgsql import, but as we have seen before, that operation is not a problem since we are then operating on the server.

I conclude this review, stating that shp2pgsql has a great performance in the task of  loading a Big Data spatial table into a remote PostGIS server (e.g.: Amazon RDS). It is recommended that the “-D” flag is passed to the tool, and that no indexes are created during this operation.

Bellow we can see the PostGIS geometry of the imported table, with more than 3 million points:

layer_shp2pgsql

Encrypting Passwords on the Client Side

I recently implemented an authentication system, supported by a table on the database. In this table, I store the usernames and passwords.
The PostgreSQL “pgcrypto” module supports encryption of specific columns, but I wanted to avoid “bulking” my system with more add-ons; also, I was not sure about trusting the administrator of the database. For all these reasons, I decided to go for client-side encryption.

As I am using the QT API, my first guess was to look for encryption related functionality on this framework.

Since Qt 4.3., there is a class called QCryptographicHash, that provides a way to generate cryptographic hashes, from binary or text data. It supports a couple of “popular” message-digest algorithms such as MD5, MD4 or SHA1. This is interesting, but not usable in my case since “hashing” is a “one-way process”: you can create an hash from data, but not “decrypt” it back to the original value.

For the purpose of encrypting and decrypting data, there is the QCA library, which is based on SSL and therefore uses asymmetric cryptography. I had a quick look at this library and it looks pretty powerful; however, I was in a “quest” to avoid installing libraries, so I decided to look for another alternative. I also have to add that I am not in need to achieve “bullet proof” encryption, but solely to keep my data hidden from “curious eyes of users”, that don’t know much about programming or cryptography. *If you want strong data encryption, then QCA is the way to go*.

Finally I stumbled upon SimpleCrypt (Simple Encryption), a class developed by a Qt user to protect against “casual observation”: just what I was looking for.
SimpleCrypt provides symmetric encryption (the same key is used for encode and decoding the data). This key is a 32 bits quint64, built into the program, on the SimpleCrypt class.

SimpleCrypt crypto(Q_UINT64_C(5ebed0982db747741f47)); //some random number

It can be used to encode strings, as well as streams of binary data. More details about the encryption algorithm can be found here.

In my application, I use a QDataWidgetMapper, to map different widgets such as line boxes, to fields in the database. In the case of the “password” field, the mapping won’t work since the user will type and see a password that is different from the cyphertext stored in the database table. To prevent this, I need a “proxy” between the database and widget; this proxy is the QItemDelegate.
By reimplementing this class and apply it to my widgets, I can “force” the application to transform the values that it writes/reads in the database, by applying the encrypting algorithm.
This would be my the implementation for the item delegate:

#include "passwddelegate.h"
#include "simplecrypt.h"

SimpleCrypt crypto(Q_UINT64_C(5ebed0982db747741f47)); //some random number

PasswdDelegate::PasswdDelegate(QList<int> colsOthers, QList<int> colsText, QList<int> colsPass,QObject *parent):
    NullRelationalDelegate(colsOthers,colsText,parent), m_colsPass(colsPass)
{

}
void PasswdDelegate::setModelData(QWidget *editor, QAbstractItemModel *model, const QModelIndex &index) const
{
    if (m_colsOthers.contains(index.column()) || m_colsText.contains(index.column()) ){
        NullRelationalDelegate::setModelData(editor,model,index);
    }else if (m_colsPass.contains(index.column())){

            model->setData(index, crypto.encryptToString(editor->property("text").toString()));

    }
    else QSqlRelationalDelegate::setModelData(editor,model,index);
}

void PasswdDelegate::setEditorData(QWidget *editor, const QModelIndex &index) const
{
    if (m_colsOthers.contains(index.column()) || m_colsText.contains(index.column())){
        NullRelationalDelegate::setEditorData(editor,index);
    }else if (m_colsPass.contains(index.column())){

           editor->setProperty("text", crypto.decryptToString(index.data().toString()));

    }
    else QSqlRelationalDelegate::setEditorData(editor,index);
}

The method “setModelData” is called each time that we want to write values in the encrypted field, and it transforms the plain text password into a cyphertext, by applying the compiled key. The method “setEditorData” on the other hand, transforms the cyphertext in the database into plain-text, by applying the same key.

This is how the pasword looks to the user:

cypher1

And this is what is actually stored on the database:

cypher2

SimpleCrypt provides a simple method for “cyphering” user passwords in the database; it is not “unbreakable” but it is one extra level of security, and it is pretty easy to use, not requiring the installation of any extra libraries (you may download simplecrypt.h and cimplecrypt.cpp from here). The QItemDelegate is a class that allows us to control what the user “sees”, while still benefiting from the advantages of using a QDataWidgetMapper. In this case, it works as a charm.

Quick guide to Multi-master Replication in PostgreSQL

A while ago, I wrote a post about multi-master replication using symmetricDS. My scenario consists of a system with multiple nodes, all of them writing in their copies of the database. Sometimes the nodes may be offline, but I would like the system to be *eventually consistent*.

SymmetricDS is a Java-based framework that supports a number of RDBMS, including PostgreSQL, the one that I use. I don’t particularly like the fact that it uses Java: specially the UI, seems slow and unresponsive. However, the fact that the application is cross-platform is quite “handy”, as we can have the databases running in a number of different, and “talking to each other”.
SymmetricDS itself is free and Open Source (GPL). However, if you want to use the configuration GUI, there is a commercial product called “SymmetricDS Pro”. I could not find out how much the pro version costs (they are quite secretive in the website), but since I was in a bit of a rush to setup the synchronization system, I decided to try it out.
Previously, I evaluated the FOSS version, and was able to synchronize 2 databases on Ubuntu systems: what they called a “Standard 2 Tier Configuration”. This time, I went for a slightly more complicated scenario: synchronizing three different databases, all in different hosts, with a mix of Windows and Linux systems. With the help of the “pro” GUI, and the “Quick-start manual”, it took me less than two days to do it, which I think is ok.

DISCLAIMER:
Before start reading this post, please note that database replication is a complicated issue. Multi-master asynchronous replication is *definitely* complicated, with many things involved, so don’t expect the configuration to be a simple wizard. To be able to use it you need to understand well a series of concepts, that won’t take you five minutes. Having said this, “SymmetricDS Pro” does a pretty good job in helping a person that *has this concepts*, performing that task.

My case study, is a real world scenario where I have three different hosts running copies of my application and database. However it may be over-simplified, since I am doing simple operations with the application (inserting/updating data with all the nodes online). Asynchronous multi-master replication “gives space” for the rise of conflicts, and although SymmetricDS does provide some support for dealing with conflicts, this is a highly sensitive topic, that must be dealt on a “case-to-case” basis, by a person with a good knowledge of the domain. On my case study, I did not arrive to any conflicts so I won’t evaluate how symmetriCDS deals with them. Please have this issue in mind, if you decide to adopt SymmetricDS.

SymmetricDS Pro is not free, but you may download it and evaluate it for 30 days:

http://www.jumpmind.com/products/symmetricds/download

It is essential to give your email address, where they will provide you with the key to “unlock” the full functionality. I found it very easy to install it, following the instructions on the quick-start guide:

Click to access SymmetricDSPro-QuickStart-3.5.pdf

The only dependency is the Java Runtime Environment (JRE), which very likely you will already have running on your system, anyway.
In the guide they mention a “single-homed” scenario, where you will have a single instance of symmetricDS running and a “multi-homed” scenario, where you install a copy of symmetricDS for each host/database. Since I wanted to approach a “deployment scenario” with remote computers I went straight to the “multi-homed”. However, if you just want to test it, you may try the “single-homed” scenario (which is supported in the manual).

Although SymmetricDS enables a distributed system, you need to create a node that acts as a “registration server”. This node has to exist, even if you can make the other nodes “talk” to each other. Although it is ok if this node is offline for a while, I would pick a host that is mostly online (like a actual server).

I started by installing symmetriCDS in my “server” node. The installation is exactly the same on any node and when you finish, you start running the daemon (running something like “/symmetricDS/bin/sym”), and then run the node setup.

If you have installed symmetricDS on port 31415 (the default non-secure port), the configuration console can be run from pointing your browser to this address:

http://localhost:31415

Since I was on the server host, I choose to setup a “server” node. SymmetricDS presents you with two “ready made” configurations, and an option to create your own, called: “I’ll configure things myself”. This is actually a very important step of your configuration, since it will define the architecture of the system (how many nodes you have, how they connect to each other, etc); later you may refine the configuration options, but the first decision is made here, so it is important to think well. Since I was a bit intimidated by the “I’ll configure things myself” option, and the “Standard 2 Tier Configuration” is the only one supported in the manual, I decided to go for this one first. If you are looking for a sort of tutorial, I would recommend this one, in order to check that everything is working on your system, etc.

Although they “claim” in the manual that the “client” group may correspond to many nodes, connected to one server, I found out that I could only make each client to talk to the server (and vice-verse), but I could not make the client nodes to talk to each other. It was like they were subscribing the “news” from the server, but the “news” that were arriving to the server via other nodes were not actually considered as “news”.

After that, I decided to try the “Multiple Sources to One Target Configuration”, which is also described as “Data Warehousing”. This was not exactly what I was looking for, but I was able to modify the architecture, until I arrived to something that suited me (and that I will describe later). The next screens, let you define the database connection string, and the url for communicating with the SymmetricDS instance; in my case:

http://invislaptop:31415/sync/regsvr

(where invislaptop resolves to my server’s IP address)

After this, you are taken to the configuration dashboard, that should be “unlocked”, by using the key provided by email. The next thing you want to do, is to go to the “configuration” section. This section is very powerful, at the same time that is complicated and it allows you to tune and refine every aspect of the synchronization, with the aid of some tools for “bulk” tasks. It is certainly possible to do all this (on the FOSS version), by editing the configuration files, but I found this GUI very useful, at least for a “newbie”.

The “Data Warehousing” “pre-cooked” configuration generates a series of node groups:

  • regsrvr: registration server
  • target: target data source
  • source1: group of nodes that provide data to the target
  • source2: group of nodes that provide data to the target
  • sourceN: …

In my scenario I “left” only three nodes: the registration server, a target and a source (“source1”), and removed the other ones. The names are not so important, and I could have just called them “regsvr”, “node1” and “node2” (for instance).

sym1

The “group links” section, actually establishes the dynamics between all these groups of nodes, whatever name you called them. In my case, the registration server “waits for pulls” from both node groups (“target” and “node1”). The “target” group pushes changes to both, to the registration server and the “source1” groups. The “source1” group, pushes changes to both, the registration server and the “target” groups.

sym2

The system could be described, by something like this:

architecture

On the “routers” tab, you can define the details of these connections between nodes, through triggers (one for each action):

sym4

The triggers for each table, are defined on the “table triggers” tab.

sym3

you may defined them individually for the tables you are interested in, or do a “bulk” define by choosing “auto-create” Then, you have the option to connect the routers to the triggers on this tab, or in the “routers” tab.
When this is done, you should have a trigger for each each table, on each update/delete/remove action (according to what you have defined).

The server setup, is actually the most complex and time consuming configuration step (which I did not cover exhaustively!). After this, I went to each of my clients, and run the installation and setup again.
This time, I choose to add a “client” node instead. The “client” nodes will attempt to register during the setup, by contacting the server on the address you provide; in my case:

http://invislaptop:31415/sync/regsvr

Unless you open the registration on the server for that particular node (by imputing its ID and group) the registration will fail. This is ok, and you can go through the entire process of creating the client, without registering the node.
When you finished the registration, if you go to the server console, and open “Manage nodes”, you will see one url under the server entry. This should be the client node, that contacted the server in order to register. If you right-click this entry, and choose “allow”, the server should be able to register the node. If you want, you may re-load the data on the client, by choosing “Send initial load to” (this actually should not be necessary, as the server should send an initial load, when allowing the node).
After registering both nodes, my setup looked like this:

sym5

After successfully registering all clients on the server, the system should be up and running. Note that you should have the symetricDS daemons running on the three nodes, to have a fully functional scenario.
I edited a record on the server, and it got replicated to the Ubuntu and Windows clients.

server

target

source1

Then I tried to edit a record on each one of the other nodes (“target” and “client1”), and watched the changes being pushed to the other nodes. It seems that the daemon is listening for changes at very small intervals, since the changes were propagated through the system almost immediately. However I did not test it with more complex changes, including batches of data.

From this experience I would say SymmetriCDS performs quite well, and with the aid of the GUI on “symmetricDS pro”, it is not too hard to setup, once you are clear about what you are looking for and understand where to setup things. This is good because I did not find much documentation on the web apart from the simpler scenario (“Standard 2 Tier Configuration”), neither did I find posts on forums discussing this.

Furthermore it would be interesting to test this system with a “tougher” scenario: larger and more complex batches of changes, more nodes, and sometimes some (or all of) them offline. This would obviously trigger the “conflict” situation, which is also the one that “scares” me most.

Quick guide to Auditing a (postgreSQL) Database: putting it all together

On my previous post, I suggested how to create a schema, a table and a trigger function, in order to audit a PostgreSQL database.
To audit a table, you would have to create a trigger for that table, calling the code from the generic trigger.
In my case, I want to audit every table in the database, and I think most people will likely want to audit every table, or at least most tables in the database.
To escape the tedious task of writing code to implement that n-times, I put together a script that will generate an audit trigger for each table in the database.If you want to apply it to a restricted number of tables instead, you could easily change it to read the table names from a list.

CREATE OR REPLACE FUNCTION create_audit_triggers()
  RETURNS void AS
$BODY$  
 DECLARE 
 r RECORD; 
 _string varchar ( 1000 );	
  BEGIN


FOR r IN SELECT distinct tablename FROM pg_catalog.pg_tables where schemaname='public'  LOOP

	IF NOT EXISTS(SELECT *
			     FROM information_schema.triggers
			     WHERE event_object_table = r.tablename
			     AND trigger_name = r.tablename || '_audit'
			     )
			    THEN

				--raise info '%' , r.tablename;
				_string :=' CREATE TRIGGER ' || r.tablename || '_audit ' ||
				' AFTER INSERT OR UPDATE OR DELETE ON ' || r.tablename ||
				' FOR EACH ROW EXECUTE PROCEDURE audit.if_modified_func();';
				raise info '%', _string; 
				EXECUTE ( _string ) ; 	

	END IF ; 

end loop;

  END;
  $BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;
ALTER FUNCTION update_info_tables2()
  OWNER TO postgres;

This will check if the trigger already exists (for which an error would be raised!), and generate the triggers during the blink of an eye (depending on the size of your database!). Thus you could use it for updating the triggers, after you added a couple of tables in the database.

Quick guide to Auditing a (postgreSQL) Database

According to Wikipedia, ‘Database auditing’ involves observing a database so as to be aware of the actions of database users. A bit like ‘spying on the user activity’.
This of course, could be useful, if you have a database with multiple users and store some sort of ‘confidential’ information.

spy

Previously I implemented some code to do this in SQL Server, that basically involved:

  • creating a table to store the audit information.
  • create an ‘encode scheme’ for this table (i.e.: how to distinguish ‘edits’ from ‘removes’, etc).
  •  creating triggers for every table that I wanted to audit, that do the ‘house-work’ (these were of course, generated from a script)

Moreover, these ‘encoded changes’ where exported to JSON, as they were the basis for a synchronization system that I implemented,

This was working fine, until I had to port the database to another RDBMS, which gave me the opportunity to rethink the structure that I had in place.
Instead of porting directly the code from T-SQL to PSQL, I did a little research and (gladly!) found out that Postgres had an ‘inbuilt’ support for audit. It took me about 10 minutes, to ‘get grips with it”, which is what I describe next.

First thing would be to create a table to store the ‘changes’. The PostgreSQL wiki, actually advises to use a different schema, which I I agree is a good idea.

-- create a schema named "audit"
CREATE schema audit;
REVOKE CREATE ON schema audit FROM public;

CREATE TABLE audit.logged_actions (
schema_name text NOT NULL,
table_name text NOT NULL,
user_name text,
action_tstamp timestamp WITH time zone NOT NULL DEFAULT current_timestamp,
action TEXT NOT NULL CHECK (action IN ('I','D','U')),
original_data text,
new_data text,
query text
) WITH (fillfactor=100);

REVOKE ALL ON audit.logged_actions FROM public;

The stored information is: the relevant schema and table names, the username (so it is important to enforce a user policy here) and a timestamp. Then we also have the ‘type’ of action, that is one of the following: insert, delete or update. This is not supporting triggers or selects, which I guess it’s ok. Then we have the previous value and current value. For inserts, we have a change from nothing to something; for deletes we have a change from something to nothing and from updates we have a change of something to something. In a way, you could figure out the change type from looking at these values, so the ‘action’ field is perhaps a bit redundant; but then, you would have to represent ‘nothing’ as a sort of special value (or keyword) and not an empty space (that could be misunderstood by an empty string).
Finally we have the ‘query’, which is the exact query that triggered this audit. Although this is also a bit redundant, since it could be reconstructed from the other values, it is not a bad idea since it allows to quickly see/reproduce exactly what happened.

Then we can create some indexes:

CREATE INDEX logged_actions_schema_table_idx
ON audit.logged_actions(((schema_name||'.'||table_name)::TEXT));

CREATE INDEX logged_actions_action_tstamp_idx
ON audit.logged_actions(action_tstamp);

CREATE INDEX logged_actions_action_idx
ON audit.logged_actions(action);

The next step is to create the trigger to ‘fill’ this table, on relevant actions:

CREATE OR REPLACE FUNCTION audit.if_modified_func() RETURNS TRIGGER AS $body$
DECLARE
v_old_data TEXT;
v_new_data TEXT;
BEGIN
/*  If this actually for real auditing (where you need to log EVERY action),
then you would need to use something like dblink or plperl that could log outside the transaction,
regardless of whether the transaction committed or rolled back.
*/

/* This dance with casting the NEW and OLD values to a ROW is not necessary in pg 9.0+ */

IF (TG_OP = 'UPDATE') THEN
v_old_data := ROW(OLD.*);
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data,v_new_data, current_query());
RETURN NEW;
ELSIF (TG_OP = 'DELETE') THEN
v_old_data := ROW(OLD.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data, current_query());
RETURN OLD;
ELSIF (TG_OP = 'INSERT') THEN
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_new_data, current_query());
RETURN NEW;
ELSE
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - Other action occurred: %, at %',TG_OP,now();
RETURN NULL;
END IF;

EXCEPTION
WHEN data_exception THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [DATA EXCEPTION] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN unique_violation THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [UNIQUE] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN OTHERS THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [OTHER] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
END;
$body$
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = pg_catalog, audit;

The ‘TG_OP’ variable inside a trigger, is a string that tells us for which operation the trigger was fired (INSERT, UPDATE, or DELETE). The rest is fiddling around with the ‘old’ and ‘new’ values.

Actually this is all to it, regarding having the audit ‘structure’ in place for postgresql. To put it ‘in action’, auditing a table is as simple as this:

CREATE TRIGGER fr_frame_audit
AFTER INSERT OR UPDATE OR DELETE ON fr_frame
FOR EACH ROW EXECUTE PROCEDURE audit.if_modified_func();

In the example above, I created a trigger for auditing table ‘fr_frame’ (of course you can create a sp procedure to generate these statements, if you want to generate a trigger for auditing every table in the database…).
Then I went to table ‘fr_frame’ and deleted a row. It got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:36:52.063113+00";"D";"(67,"bin frame",missing,missing,1,1,missing)";"";"DELETE FROM public.fr_frame WHERE id = '67'::integer"

Then I modified and added a row; it got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:38:54.793902+00";"U";"(59,"Sampling Frame",missing,"Initial Sampling frame",1,2,missing)";"(59,"Sampling Frame","Sampling Frame","Initial Sampling frame",1,2,missing)";"UPDATE public.fr_frame SET nameeng='Sampling Frame'::text WHERE id = '59'::integer"
"public";"fr_frame";"postgres";"2014-02-11 12:39:14.290898+00";"I";"";"(68,test,test,test,1,2,test)";"INSERT INTO public.fr_frame(name, nameeng, description, id_cloned_previous_frame, id_source, comments) VALUES ('test'::text, 'test'::text, 'test'::text, '1'::integer, '2'::smallint, 'test'::text)"

As I said before, the encode of non-existing values (nothing) as an empty string is not the most accurate approach, but since we have more information to complement it, it works. Also, the fact that it is row-based rather than field-based (as I had in my implementation), originates the serializing arrays, which is not exactly normalized… on the other hand, I can accept it from the point of view of storage.

Overall I think I accept it as a good solution, measuring all the pros and cons, and high fives for being so simple to implement.

You can find this information (and more), in the PostgreSQL wiki.

“Obscure” Databases in Detail

Carrying on a bit more detailed analysis from my previous post, I will try to point some flaws that I saw in databases stored within a Relational Database Management System (RDBMS), and discuss some solutions.

To start with the basics, why using a relational model at all to represent the data?

Skipping the obvious answer that the database is stored already using a relational engine, “normalization” (Codd, 1970) is the adoption of “normal forms of relations” with the goal of avoiding certain data anomalies that can occur in unnormalized tables. In other words, being very strict with the way of representing data is going to protect the consistency and quality of the dataset we are building. On the other hand, not being strict with the data model (or not enforcing a data model at all) may create a scenario with poor quality data, where we cannot decode the information we have, which technically means it is “lost” (even if it is physically stored in the database). Think about this as an unbreakable safe box, for which you have lost the only existing key in the entire world.

The normal forms are an attempt to make sure that you do not destroy true data or create false data in your database, so it is always a good idea to go through them when designing (or redesigning) a schema.

1. Representation of arrays

One of the “bottlenecks” in the relational model, specially for people that come from a programming background is the representation of lists or groups. Sometimes people fake the representation of arrays, by repeating a group of columns, or “flattening out” the array. Apart from the data storage overhead, this is not conform with the NF1 principle. Let us look at an example.

Table “Students”

dprt student
Geography John
Geography Martin
Geography Maria

In this example, we cannot remove a course (or update it) without involving a series of rows.

Sometimes arrays are also represented by encoding a list inside a text field, for instance separated by commas. This is also not conform with normalization, since this data type is non-existent and it is breaking the principle that the combination of row, column in a table should always refer to one record. The result is a lot of string-handling procedures to work around this kludge.

Table “Students”

dprt students
Geography John, Martin, Maria

In this example, to delete (or update) a student name we need to parse and modify the students string which is so inconvenient that I don’t need to explain further why it is not a good idea to have this approach.

The proper way to represent arrays in this case is to split the students info in two tables, and relate them with a key.

Table “Departments”

id name
1 Geography

Table “Students”

id id_dprt name
1 1 John
2 1 Martin
3 1 Maria

The two tables above are in the normalized form, and we can operate over them using the relational model. Generally, when things involve complicated procedures to work, it means they are prone to errors, so we should keep everything as simple as possible. With the table structure above, I can implement a cascade delete that will remove all the students in the Geography Department, once I close this department. This goes the same as saying, that once I close the Department, there will be no students enroled in this department lying around the database (I call them “ghost” records).

2. Use of Natural Keys

Natural keys in themselves are not discouraged. They have a meaning to the user, which is generally a good idea, and their use prevents generating extra columns on purpose to be keys. Automatic keys (like identity fields) are generally not conform with the normalized model since they bring some dependence from the physical model (and may be implemented differently by different vendors) and they bring replication issues. With the computer power that we have nowadays it is also not a such problem to have a key involving multiple columns.

However these natural keys may be “hiding” other problems, that are themselves a “threat” to a normalized database.

2.1 Are these keys “preventing” implementing the relation between tables?

In my University database, I can represent the entities: universities, departments, courses and students as different tables. Since I know, there are one-to-many relationships in Universities->Departments, Departments->Courses and Courses->Students, I can enforce this idea by using foreign keys constraints.

CREATE TABLE Universities (
    id        integer,
    name       varchar(40),
    CONSTRAINT pk_univ PRIMARY KEY(id)
);

CREATE TABLE Departments (
    id        integer,
    id_universities integer references Universities(id),
    name       varchar(40),
    CONSTRAINT pk_dprt PRIMARY KEY(id)
);

CREATE TABLE Courses (
    id        integer,
    id_dprt integer references Departments(id),
    name       varchar(40),
    CONSTRAINT pk_courses PRIMARY KEY(id)
);

CREATE TABLE Students (
    id        integer,
    id_courses integer references Courses(id),
    name       varchar(40),
    CONSTRAINT pk_students PRIMARY KEY(id)
);

If I want to know the University of a student named ‘Martin’, I just have to build my query knowing the “path” from Student to University.

Select Universities.name, Students.name
from Universities inner join Departments ON Universities.id=Departments.id_university
inner join Courses ON Departments.id=Courses.id_dprt
inner join Students ON Courses.id=Students.id_course
WHERE student.name like 'Martin'

(DISCLAIMER: please forgive me if there is some error in this query: I did not actually test it, but I think this is sufficient to get the general idea…)

On the other hand, if I don’t implement this structure in the database, and I still want to know the University of a named Student, or the University of a named course, I can do something like this.

CREATE TABLE Universities (
    name       varchar(40),
    CONSTRAINT pk_univ PRIMARY KEY(name)
);

CREATE TABLE Departments (
    university_name varchar(40),
    name       varchar(40),
    CONSTRAINT pk_dprt PRIMARY KEY(name)
);

CREATE TABLE Courses (
    university_name varchar(40),
    name       varchar(40),
    CONSTRAINT pk_course PRIMARY KEY(name)
);

CREATE TABLE Students (
    university_name varchar(40),
    name       varchar(40),
    CONSTRAINT pk_students PRIMARY KEY(name)
);

In this case, I can know the University of any student (or any course), by querying a single table. The “natural key”: university_name, will provide the answer to my question. This design may run into problems, for instance if I have two students with the same name, and on the same University, but enrolled on different departments. That problem could be “solved” by bloating the Students table with a two-column key:

CREATE TABLE Students (
    university_name varchar(40),
    course_name varchar(40),
    name       varchar(40),
    CONSTRAINT pk_students PRIMARY KEY(university_name,course_name)
);

It is easy to imagine other violations of uniqueness or other requests of information, that could lead us to add more columns to the primary key. My whole point here is that this is a strategy to avoid implementing a relational structure in the database, and therefore it is conceptually wrong. If we know the relations between entities, we should implement them exactly as they are, and not approach the database as a set of self contained-tables.

Actually this behaviour can be tracked to people who first worked with file systems, using tapes. They tend to design one huge file and do all the work against these records; this made sense at the time, since there was no reasonable way of joining a number of small files together without having the computer mount and dismount lots of tapes. With RDBMS there is absolutely no reason to avoid joins between tables. A normalized database tends to have a lot of tables with a small number of columns per table.
If we cannot produce this sort of output because we don’t know the relations between the entities we are modelling, than I would actually question the use of a database at all (or at least, a relational database).

2.2 Do we have information represented in more than one place?

In the example above, we have the natural key “university_name” repeated all over the database. Note that there are no foreign keys, and therefore there is no guarantee that the university names on tables “Courses” and “Students” match the names on table “University”. It is easy to imagine why we may run into problems with this approach. The key that determines one attribute should be in only one table and therefore the attribute should be with it. All the other tables that “mention” this attribute should refer to this table.
Apart from the fact that we are not “repeating” information all over the database and “bloating” the storage, this is the only way of guaranteeing consistency among the dataset. If a fact appears more than once, one of the instances is likely to be in error.

The database is only good in “safeguarding” things that we implement. Since the table “Students” is not formally connected to table Courses, nothing prevents me from having a course listed in students, that is not listed at all in table “Courses”. This may be, either because I did not introduce it there, or because I misspelled it (or some other reason that I can’t remember right now). If we use an application to interface with the database, we could implement strategies to prevent this kind of errors, but actually:
why should we put this responsibility in the “shoulders” of the front-end when it could be enforced by design, in the first place?

3. Table and Column Names

Although there are not strict conventions for naming tables and columns, we should stick to the idea that tables represent either “entities” or “relationships” and the columns represent attributes, and named them accordingly. Since they store sets of rows, it is expressive if the name is in plural (like “Students” or Universities”). Actually this may seem like being over-careful, but there are not many things as bad in terms of data loss, as having a database where we don’t know what the tables or the attributes represent (and not having any enforced constraints only makes it worst).
It is also a good idea to keep the names consistent across the namespace. For instance if we have a column named “temperature” and another one “called “watertemp” and they both refer to the same thing (temperature of the water), it is generally a good idea to use the same name. This, of course, goes against the paradigm of a database as a repository of data from different sources where we don’t even change the field names when importing. But with that kind of database, there is very little you can do without a tremendous amount of work (and possible even then…).
Regarding naming conventions, everything that makes the database easier to understand and to work with, should be implemented.

4. Use of Nulls

The use of nulls brings a series of issues and should be avoided as much as possible, for instance by providing a default value or a meaningful code (“missing”, “invalid”, etc). Having many nulls on a table, may be a sign that it has not been “normalized” properly (for instance the “flattening” of an array). Nulls should *only* be used for a value that it is missing now, but may be resolved later.

5. Information that should not be stored in the database

My final thoughts go to the “temptation” of mixing the contents of the database with the mechanisms to manage this information (for instance the front-end that we use to visualize it and populate it). A database models a certain reality (for instance a University, or an Address Book). Introducing “auxiliary” tables that are tied to the front-end is possible, but not without “polluting” the dataset with information about how we deal with the dataset, which is not really part of the dataset itself. Frontends may change, and ways of visualizing data may change, but our data layer should be kept conveniently “abstracted” from all these things.
This is a reflexion that I make, that may justify the use of n-databases instead of one, where we can relate them to each other. The extra complexity required by the connection mechanism can be justified by the clarity of the system, and the simplicity of dealing only with the parts that we need.

Welcome to the “Obscure” World of Databases

How would we picture the image of a non doctor, entering an operations room and performing a surgery? Probably pretty badly. After removing the scope of the consequences, to me this is very similar to the image of a non database person, designing a database.

For several reasons, people who have not been formally trained as computer scientists find themselves performing IT tasks, some of them very sophisticated. I don’t see anything wrong with people that were not formally trained embrassing an IT career (actually I am one of them), as long as… they do embrace it. That is: study or research the things they don’t know, and do care about standards of quality. If they don’t do it, then they are just bringing into “shame” the name of everyone who is working on the field, by bringing the level really down.

All these thoughts – that are actually recurrent in my life – were triggered by analysing what I thought it was a relational database. By going through it and refactoring it, I got a pretty good collection of the things that you should not do, when designing a database. These are some ideas that I would like to clarify; if you think that they are obvious, you would be surprised with what I saw.

Choosing a relational database management system (RDBMS), does not automatically mean that you have a relational database. A relational database is a database that is organized in terms of the relational model, as formulated by Edgar F. Codd. According to Wikipedia[1], the purpose of this model is to “provide a declarative method for specifying data and queries: users directly state what information the database contains and what information they want from it, and let the database management system software take care of describing data structures for storing the data and retrieval procedures for answering queries”. It is a requirement of the model that the users state exactly what the database contains, and describe properly how it is organized; only in this way, it is possible to query the database and obtain exact answers to our queries.

If the database is a repository of information, some of it unkown or unnecessary, and not really organized, that means that we are not modelling the data, but only storing information. The outcome of this, is that we cannot query the database and produce any usefull answers about our data (there are no miracles!). We can store the information, but not in a much better way than if we were storing it in a filesystem. Users of this repository may implement themselves ways of dealing with this data by creating relations on-the-fly, either by using the SQL engine, or by pullling the data out and do it somewhere else. However they are not gaining anything from the data model, as they have to figure out themselves how the data is organized, and there is no guarantee that any two people would not do it in a different way.

My first bullet point is that there is absolutely no point in using a relational database engine and *not* implement a relational model; it is probably a worst solution than using a filesystem, because it may pass the wrong idea to people: that there is a relational model.

If people are not implementing the relational model because they do not know the data, or because the data is not organized in a relational manner, this is an explanation I can accept. In fact I think for many cases (and some of them which are being approached relationally) the rigid structures of a top-down design are not the providing a good solution, because knowing everything (or at least a lot) about the data is a weak assumption. For this cases there is NoSQL[2], and specially the document-driven databases provide a very flexible model, close to what people implement with filesystems. In other words: this is ok, but then don’t use a relational database engine.

On the other hand, if people are not implementing the relational model because they don’t know it, then they should learn about it. They should learn about it, at least to be able to decide that they don’t use it and go for a NoSQL database. Finally if they do not want to learn about databases, or relational models, because they are “just” biologists or economists, that is also fine but then please don’t let them design a database, specially if it is one where you want to store valuable data. And I am afraid of the institutions who let this sort of thing happen.

Image

 

[1] http://en.wikipedia.org/wiki/Relational_database

[2] http://en.wikipedia.org/wiki/NoSQL