Parametrizing and interpreting OPTICSxi Clustering

For a while now, I have been working on the application of the OPTICS clustering, for user generated data in cities.
OPTICS is a density-based algorithm that attempts to overcome some of the “weaknesses” of its most famous counterpart: DBSCAN. The major weaknesses of DBSCAN are:

  • the inability to detect clusters in zones of varying density.
  • the choice of parameter values, for which it is very sensitive.

I am using an implementation from the ELKI Data Mining libraries, which is one of the few existing ones. Unlike DBSCAN, the OPTICS algorithm does not produce a strict cluster partition, but an augmented ordering of the database. To produce the cluster partition, I use OPTICSxi, which is another algorithm that produces a classification based on the output of OPTICS. There are even fewer libraries capable of extracting a cluster partition from the output of OPTICS, and ELKI’s OPTICSxi implementation is one of the few ones.

In a nutshell, the parameters needed for the OPTICS algorithm are:

  • epsilon: this parameter is meant as a “maximum” distance to consider, and not a specific distance to consider (DBSCAN); a whole range of distances is considered in the OPTICS algorithm, up to epsilon; although we can choose an “extremely” large epsilon, that will increase the time the algorithm will take to converge;
  • minpts: this is the number of points required to form a cluster (the same as in DBSCAN);

The OPTICSxi algorithm adds an “extra” parameter:

  • xi: contrast parameter, that establishes the relative decrease in density; this parameter controls directly the number of classes we will obtain.

Taking in consideration what I wrote above, if we “replace” DBSCAN by OPTICxi, we will still have to choose a min points, the epsilon will become less important (but we still need to set it), and suddenly we have a “new” parameter to set: xi. I am not completely sure this is a gain in terms of easier parametrization….

It is very clear to me, how-to interpret the results of DBSCAN (although it is not that easy, to set “meaningful” global parameters); DBSCAN detects a “prototype” of a cluster, characterized by a density, expressed as a number of points per area (minpts/epsilon). The results of OPTICSxi seem a bit more difficult to interpret.

The clusters generated by OPTICSxi are hierarchical, where “outer” clusters (lower in the hierarchical scale), contain inner clusters (or “child” clusters). Looking at the algorithm, it is clear that the idea of varying densities, is implemented by having a range of epsilon parameters, that is actually based on the data. When we set the “contrast” parameter, we actually “decide”, which density variation we accept, in order to consider that group a cluster. Because the algorithm is focused on “density variation”, rather than on a global value of density (or a “prototype”), it may well happen that areas that have a very low density appear as a cluster, just because they have a density variation from their surrounds that is greater than the accepted threshold. Likewise, we may have areas that are extremely dense, and are not detected as clusters, because there is a smooth density variation from their surroundings. In terms of interpretation this may well result in a bottleneck.

On the image bellow, we see clusters in parts of the city that are not particularly dense (like the West) and are not detected as clusters by DBSCAN. On the other hand, the centre of the city (more dense), as only a few clusters.

Image

epsilon=500; xi=0.03; minpts=150;

It is interesting to note, that since OPTICS does not impose a strict “prototype” of density, the clusters in denser areas (in the centre of the city) are smaller in area than the clusters in less dense areas (the surroundings).

There are two phenomena that I sometimes detect in the outputs of OPTICSxi, and that I am not able to explain. One is the appearance of “spike” clusters, that link parts of the map. I cannot explain them, because they seem to be made of very few points and I don’t understand how the algorithm decides to group them in the same cluster. Do they really represent a “corridor” of density variation? looking at the underlying data, it does not look like that. You can see these “spikes” in the image bellow.

Image
epsilon=1000; xi=0.05; minpts=100;

The other phenomenon that I cannot explain is the fact that sometimes there are overlapping clusters of the same hierarchical level. OPTICSxi is based on the OPTICS ordering of the database (e.g. dendrogram) and there are no repeated points in that diagram.

Image
http://en.wikipedia.org/wiki/OPTICS_algorithm

Since this is a hierarchical clustering, we consider that clusters of a lower level contain clusters of a higher level, and that idea is enforced when building the convex hulls. However, I don’t see any justification for having clusters that intersect other clusters on the same hierarchical level, which in practice would mean that some points would have a double cluster “membership”. On the image bellow, we can see some intersecting clusters with the same hierarchical level (0).

Image

Finally the most important thought/question that I want to leave you with, is: what do we expect to see in an OPTICSxi cluster classification? This question is closely linked to the task of parametrizing OPTICSxi.
Since I see hardly any studies with runs of OPTICSxi for a particular cluster problem, I struggle to find what is an optimal clustering classification would be; i.e.: one that can provide some meaningful/useful results, and add some value to the DBSCAN clustering. To help me answering that question, I performed many runs of OPTICSxi, with different combinations of parameters, and I selected three that I will discuss bellow.

Image
epsilon=2000; xi=0.025; minpts=100;

On this run I used a large value of epsilon (2Km); the meaning of that value is that we accept large clusters (up to 2Km); since the algorithm “merges” clusters, we will end up with some very large clusters, that will have almost certainly a low density. I like this output, because it exposes the hierarchical structure of the classification, and it actually reminds me of several runs DBSCAN with a different combination of parameters (for different densities), which is the advertised “strength” of OPTICS. As it was mentioned before, smaller clusters correspond to higher levels in the hierarchical scale, and higher densities.

Image
epsilon=250; xi=0.035; minpts=10;

On this run we see a large number of clusters, even if the “contrast” parameter is the same from the previous run. That is mostly because I chosen a low number of minpts, which established that we accept clusters with a low number of points. Since the epsilon in this case is shorter, we don’t see these large clusters occupying a large part of the map. I find this output less interesting than the previous one, mostly because, even if we have an hierarchical structure there are many clusters at the same level, and many of them intersect. In terms of interpretation, I can see an overall “shape” that is similar to the previous one, but it is actually discretized in lots of small clusters that are easily overlooked as “noise”.

Image

epsilon=250; xi=0.035; minpts=100;

This run has a parameter choice that is similar to the previous one, except that the minpts is larger; the consequences is that not only we find less clusters and they overlap less, but also that they are mostly at the same level.

In a perspective of adding value to DBSCAN, I would opt for the first combination of parameters, since it provides a hierarchical picture of the data, exposing clearly which areas are more dense. IMHO the last combination of parameters, fails to provide an idea of the global distribution of density, since it is finding similar clusters all over the study area.

Advertisements

Clustering Geospatial Data

Recently I have been looking into different algorithms for the clustering geospatial data. The problem of finding “similar” regions in space, is a very interesting one, since this type of classification enables a whole range of applications (e.g.: urban development, transport planning, etc).
When we are working with “Big Data”, as the one resultant from streaming of crowd sourced date (for instance), the amount of information makes it difficult to visually detect things; this is often where artificial intelligence can “give us an hand”.

Image

DBSCAN is a density-based clustering algorithm, that can find an, a priori unknown, number of clusters with an arbitrary shape. It starts with a certain “idea” of what a cluster is, based on a density as defined by its parameters: epsilon and minpts. This “idea” is also the main weakness of DBSCAN, as the global parameters, don’t allow to capture different densities in the dataset, if that is the case; and often, in geographical datasets, it is.

Image

In the example above, we can see that a different value of epsilon, and therefore a different threshold of density, yields very different results. With the larger epsilon we detect clusters around all the AOI, but the centre comes up as a single cluster. With the smaller epsilon, we have more detail in the centre, but we “loose” the less dense clusters in the outskirts. In a way, DBSCAN corresponds to looking at the data with “semi closed” eyes, and this is why its results make so much sense to the user.

OPTICS is considered a generalization of DBSCAN, tackling exactly these difficulty in configuring the parameters, by becoming almost parameterless. Instead of using an epsilon to limit the search for neighbours, OPTICS considers a range of epsilons, up to a maximum, that could be a radius that includes the entire AOI (although for practical purposes, you don’t really want to do that, because it increases the complexity of the algorithm). OPTICS is essentially an ordering of the database, such as points that are spatially closest become neighbours in the ordering. The output of optics is a graphic plotting this ordering, and a special distance that is stored for each point, representing the density that needs to be accepted for a cluster in order to have both points belong to the same cluster; this is called a dendrogram.

Image

source: wikipedia

OPTICS does not produce a strict partitioning of the data, but this can be done using algorithms, that try to identify the “peaks” and “valleys” in the graphic, or by selecting a range of x and a threshold of y (xi). Generally speaking these algorithms produce a hierarchical clustering, which is more difficult to interpret than the “flat” partitioning produced by DBSCAN. Conceptually, this OPTICS output would correspond to many “runs” of DBSCAN.

Image

There are implementations of DBSCAN and OPTICS in the WEKA and ELKI libraries. WEKA’s implementation of DBSCAN has been thoroughly referred as slow, when compared to ELKI’s implementation. This benchmarking illustrates well the difference between the two. Although some improvements were added in the latest versions of WEKA, this difference was confirmed by invoking the two algorithms in some test cases. OPTICS implementation in WEKA does not produce a classification. In ELKI’s library, the OPTICS algorithm is separated from the classification algorithm, following ELKI’s modular philosophy; ELKI’s OPTICSxi produces a clustering classification, using a partitioning algorithm selected (or implemented) by the user.

DBSCAN is very sensitive to its two parameters, which are quite hard to setup. Also, the parameters “influence” each other in the result. They are hard to setup because they depend largely on the particular phenomena we are studying, and which type of clusters we want to detect. If we want to identify “meaningful” things, we need to have a pretty good idea of what we are looking for. If a good domain knowledge is strongly advised, it is also advised to inspect the results of the clustering algorithm, by plotting them in a map. For all these reasons, I think that DBSCAN is very good as a exploratory tool, for producing meaningful and scientific analysis about a dataset, but I don’t see it as very “realistic” to implement it as part of a “service” that automatically analyses any dataset, “on demand”.

OPTICS seems a bit more “obscure” to me. The algorithm is more sophisticated than DBSCAN, in the sense that it has a more “flexible” idea of a cluster, but it produces a more complex result, which is more difficult for the user to interpret. OPTCIS itself does not produce any classification, so the quality of the cluster “results”, actually heavily depend on the algorithm that we choose to perform the partition.

In a way, I don’t see OPTICS as a “replacement” for DBSCAN, but as a complement. It allows us to detect some “patterns” that are not “visible” to DBSCAN, but the lack of global parameters does not allow us to capture the “global” structure of the dataset.

Encrypting Passwords on the Client Side

I recently implemented an authentication system, supported by a table on the database. In this table, I store the usernames and passwords.
The PostgreSQL “pgcrypto” module supports encryption of specific columns, but I wanted to avoid “bulking” my system with more add-ons; also, I was not sure about trusting the administrator of the database. For all these reasons, I decided to go for client-side encryption.

As I am using the QT API, my first guess was to look for encryption related functionality on this framework.

Since Qt 4.3., there is a class called QCryptographicHash, that provides a way to generate cryptographic hashes, from binary or text data. It supports a couple of “popular” message-digest algorithms such as MD5, MD4 or SHA1. This is interesting, but not usable in my case since “hashing” is a “one-way process”: you can create an hash from data, but not “decrypt” it back to the original value.

For the purpose of encrypting and decrypting data, there is the QCA library, which is based on SSL and therefore uses asymmetric cryptography. I had a quick look at this library and it looks pretty powerful; however, I was in a “quest” to avoid installing libraries, so I decided to look for another alternative. I also have to add that I am not in need to achieve “bullet proof” encryption, but solely to keep my data hidden from “curious eyes of users”, that don’t know much about programming or cryptography. *If you want strong data encryption, then QCA is the way to go*.

Finally I stumbled upon SimpleCrypt (Simple Encryption), a class developed by a Qt user to protect against “casual observation”: just what I was looking for.
SimpleCrypt provides symmetric encryption (the same key is used for encode and decoding the data). This key is a 32 bits quint64, built into the program, on the SimpleCrypt class.

SimpleCrypt crypto(Q_UINT64_C(5ebed0982db747741f47)); //some random number

It can be used to encode strings, as well as streams of binary data. More details about the encryption algorithm can be found here.

In my application, I use a QDataWidgetMapper, to map different widgets such as line boxes, to fields in the database. In the case of the “password” field, the mapping won’t work since the user will type and see a password that is different from the cyphertext stored in the database table. To prevent this, I need a “proxy” between the database and widget; this proxy is the QItemDelegate.
By reimplementing this class and apply it to my widgets, I can “force” the application to transform the values that it writes/reads in the database, by applying the encrypting algorithm.
This would be my the implementation for the item delegate:

#include "passwddelegate.h"
#include "simplecrypt.h"

SimpleCrypt crypto(Q_UINT64_C(5ebed0982db747741f47)); //some random number

PasswdDelegate::PasswdDelegate(QList<int> colsOthers, QList<int> colsText, QList<int> colsPass,QObject *parent):
    NullRelationalDelegate(colsOthers,colsText,parent), m_colsPass(colsPass)
{

}
void PasswdDelegate::setModelData(QWidget *editor, QAbstractItemModel *model, const QModelIndex &index) const
{
    if (m_colsOthers.contains(index.column()) || m_colsText.contains(index.column()) ){
        NullRelationalDelegate::setModelData(editor,model,index);
    }else if (m_colsPass.contains(index.column())){

            model->setData(index, crypto.encryptToString(editor->property("text").toString()));

    }
    else QSqlRelationalDelegate::setModelData(editor,model,index);
}

void PasswdDelegate::setEditorData(QWidget *editor, const QModelIndex &index) const
{
    if (m_colsOthers.contains(index.column()) || m_colsText.contains(index.column())){
        NullRelationalDelegate::setEditorData(editor,index);
    }else if (m_colsPass.contains(index.column())){

           editor->setProperty("text", crypto.decryptToString(index.data().toString()));

    }
    else QSqlRelationalDelegate::setEditorData(editor,index);
}

The method “setModelData” is called each time that we want to write values in the encrypted field, and it transforms the plain text password into a cyphertext, by applying the compiled key. The method “setEditorData” on the other hand, transforms the cyphertext in the database into plain-text, by applying the same key.

This is how the pasword looks to the user:

cypher1

And this is what is actually stored on the database:

cypher2

SimpleCrypt provides a simple method for “cyphering” user passwords in the database; it is not “unbreakable” but it is one extra level of security, and it is pretty easy to use, not requiring the installation of any extra libraries (you may download simplecrypt.h and cimplecrypt.cpp from here). The QItemDelegate is a class that allows us to control what the user “sees”, while still benefiting from the advantages of using a QDataWidgetMapper. In this case, it works as a charm.

Quick guide to Multi-master Replication in PostgreSQL

A while ago, I wrote a post about multi-master replication using symmetricDS. My scenario consists of a system with multiple nodes, all of them writing in their copies of the database. Sometimes the nodes may be offline, but I would like the system to be *eventually consistent*.

SymmetricDS is a Java-based framework that supports a number of RDBMS, including PostgreSQL, the one that I use. I don’t particularly like the fact that it uses Java: specially the UI, seems slow and unresponsive. However, the fact that the application is cross-platform is quite “handy”, as we can have the databases running in a number of different, and “talking to each other”.
SymmetricDS itself is free and Open Source (GPL). However, if you want to use the configuration GUI, there is a commercial product called “SymmetricDS Pro”. I could not find out how much the pro version costs (they are quite secretive in the website), but since I was in a bit of a rush to setup the synchronization system, I decided to try it out.
Previously, I evaluated the FOSS version, and was able to synchronize 2 databases on Ubuntu systems: what they called a “Standard 2 Tier Configuration”. This time, I went for a slightly more complicated scenario: synchronizing three different databases, all in different hosts, with a mix of Windows and Linux systems. With the help of the “pro” GUI, and the “Quick-start manual”, it took me less than two days to do it, which I think is ok.

DISCLAIMER:
Before start reading this post, please note that database replication is a complicated issue. Multi-master asynchronous replication is *definitely* complicated, with many things involved, so don’t expect the configuration to be a simple wizard. To be able to use it you need to understand well a series of concepts, that won’t take you five minutes. Having said this, “SymmetricDS Pro” does a pretty good job in helping a person that *has this concepts*, performing that task.

My case study, is a real world scenario where I have three different hosts running copies of my application and database. However it may be over-simplified, since I am doing simple operations with the application (inserting/updating data with all the nodes online). Asynchronous multi-master replication “gives space” for the rise of conflicts, and although SymmetricDS does provide some support for dealing with conflicts, this is a highly sensitive topic, that must be dealt on a “case-to-case” basis, by a person with a good knowledge of the domain. On my case study, I did not arrive to any conflicts so I won’t evaluate how symmetriCDS deals with them. Please have this issue in mind, if you decide to adopt SymmetricDS.

SymmetricDS Pro is not free, but you may download it and evaluate it for 30 days:

http://www.jumpmind.com/products/symmetricds/download

It is essential to give your email address, where they will provide you with the key to “unlock” the full functionality. I found it very easy to install it, following the instructions on the quick-start guide:

http://www.jumpmind.com/downloads/symmetricds/doc/SymmetricDSPro-QuickStart-3.5.pdf

The only dependency is the Java Runtime Environment (JRE), which very likely you will already have running on your system, anyway.
In the guide they mention a “single-homed” scenario, where you will have a single instance of symmetricDS running and a “multi-homed” scenario, where you install a copy of symmetricDS for each host/database. Since I wanted to approach a “deployment scenario” with remote computers I went straight to the “multi-homed”. However, if you just want to test it, you may try the “single-homed” scenario (which is supported in the manual).

Although SymmetricDS enables a distributed system, you need to create a node that acts as a “registration server”. This node has to exist, even if you can make the other nodes “talk” to each other. Although it is ok if this node is offline for a while, I would pick a host that is mostly online (like a actual server).

I started by installing symmetriCDS in my “server” node. The installation is exactly the same on any node and when you finish, you start running the daemon (running something like “/symmetricDS/bin/sym”), and then run the node setup.

If you have installed symmetricDS on port 31415 (the default non-secure port), the configuration console can be run from pointing your browser to this address:

http://localhost:31415

Since I was on the server host, I choose to setup a “server” node. SymmetricDS presents you with two “ready made” configurations, and an option to create your own, called: “I’ll configure things myself”. This is actually a very important step of your configuration, since it will define the architecture of the system (how many nodes you have, how they connect to each other, etc); later you may refine the configuration options, but the first decision is made here, so it is important to think well. Since I was a bit intimidated by the “I’ll configure things myself” option, and the “Standard 2 Tier Configuration” is the only one supported in the manual, I decided to go for this one first. If you are looking for a sort of tutorial, I would recommend this one, in order to check that everything is working on your system, etc.

Although they “claim” in the manual that the “client” group may correspond to many nodes, connected to one server, I found out that I could only make each client to talk to the server (and vice-verse), but I could not make the client nodes to talk to each other. It was like they were subscribing the “news” from the server, but the “news” that were arriving to the server via other nodes were not actually considered as “news”.

After that, I decided to try the “Multiple Sources to One Target Configuration”, which is also described as “Data Warehousing”. This was not exactly what I was looking for, but I was able to modify the architecture, until I arrived to something that suited me (and that I will describe later). The next screens, let you define the database connection string, and the url for communicating with the SymmetricDS instance; in my case:

http://invislaptop:31415/sync/regsvr

(where invislaptop resolves to my server’s IP address)

After this, you are taken to the configuration dashboard, that should be “unlocked”, by using the key provided by email. The next thing you want to do, is to go to the “configuration” section. This section is very powerful, at the same time that is complicated and it allows you to tune and refine every aspect of the synchronization, with the aid of some tools for “bulk” tasks. It is certainly possible to do all this (on the FOSS version), by editing the configuration files, but I found this GUI very useful, at least for a “newbie”.

The “Data Warehousing” “pre-cooked” configuration generates a series of node groups:

  • regsrvr: registration server
  • target: target data source
  • source1: group of nodes that provide data to the target
  • source2: group of nodes that provide data to the target
  • sourceN: …

In my scenario I “left” only three nodes: the registration server, a target and a source (“source1”), and removed the other ones. The names are not so important, and I could have just called them “regsvr”, “node1” and “node2” (for instance).

sym1

The “group links” section, actually establishes the dynamics between all these groups of nodes, whatever name you called them. In my case, the registration server “waits for pulls” from both node groups (“target” and “node1”). The “target” group pushes changes to both, to the registration server and the “source1” groups. The “source1” group, pushes changes to both, the registration server and the “target” groups.

sym2

The system could be described, by something like this:

architecture

On the “routers” tab, you can define the details of these connections between nodes, through triggers (one for each action):

sym4

The triggers for each table, are defined on the “table triggers” tab.

sym3

you may defined them individually for the tables you are interested in, or do a “bulk” define by choosing “auto-create” Then, you have the option to connect the routers to the triggers on this tab, or in the “routers” tab.
When this is done, you should have a trigger for each each table, on each update/delete/remove action (according to what you have defined).

The server setup, is actually the most complex and time consuming configuration step (which I did not cover exhaustively!). After this, I went to each of my clients, and run the installation and setup again.
This time, I choose to add a “client” node instead. The “client” nodes will attempt to register during the setup, by contacting the server on the address you provide; in my case:

http://invislaptop:31415/sync/regsvr

Unless you open the registration on the server for that particular node (by imputing its ID and group) the registration will fail. This is ok, and you can go through the entire process of creating the client, without registering the node.
When you finished the registration, if you go to the server console, and open “Manage nodes”, you will see one url under the server entry. This should be the client node, that contacted the server in order to register. If you right-click this entry, and choose “allow”, the server should be able to register the node. If you want, you may re-load the data on the client, by choosing “Send initial load to” (this actually should not be necessary, as the server should send an initial load, when allowing the node).
After registering both nodes, my setup looked like this:

sym5

After successfully registering all clients on the server, the system should be up and running. Note that you should have the symetricDS daemons running on the three nodes, to have a fully functional scenario.
I edited a record on the server, and it got replicated to the Ubuntu and Windows clients.

server

target

source1

Then I tried to edit a record on each one of the other nodes (“target” and “client1”), and watched the changes being pushed to the other nodes. It seems that the daemon is listening for changes at very small intervals, since the changes were propagated through the system almost immediately. However I did not test it with more complex changes, including batches of data.

From this experience I would say SymmetriCDS performs quite well, and with the aid of the GUI on “symmetricDS pro”, it is not too hard to setup, once you are clear about what you are looking for and understand where to setup things. This is good because I did not find much documentation on the web apart from the simpler scenario (“Standard 2 Tier Configuration”), neither did I find posts on forums discussing this.

Furthermore it would be interesting to test this system with a “tougher” scenario: larger and more complex batches of changes, more nodes, and sometimes some (or all of) them offline. This would obviously trigger the “conflict” situation, which is also the one that “scares” me most.

Ubuntu 4 Beginners

After installing Ubuntu three times, in the past few months, and after having many requests to do it again, I have finally decided to put it all together in a workshop. It is going to be next Saturday, in Barcelona, in my favourite co-working space. And it is “free” as beer, and GNU/Linux 🙂

The “official” announcement will be tomorrow, I think, but you can be the first to read it here 😉

UPDATE: THIS WORKSHOP HAS BEEN POSTPONED!

Ubuntu 4 Beginners

Did you ever think about installing Ubuntu, but never actually have the “courage” to do it alone? Then this workshop is for you.

tux1

In the first part I will introduce the GNU/Linux Operating System, by explaining some basic concepts and showing some applications.
The second part will focus mainly on the installation process of Ubuntu, and I will install it “live”, on a virtual machine.
At the end of the session I can help people who are interested, to perform the installation on their own computers. Note that this will be *at their own risk*!

Target Audience:
This workshop targets people with a limited knowledge of *Nix systems, although some proficiency in using computers would be nice.
If you are a proficient *nix user or developer, and are interested in specific parts of the OS (such as the kernel), you may be interested in a more advanced workshop. If you are wondering what a *nix user is, please come: this workshop is for you 🙂

If you have Ubuntu installed on your laptop, or you are planning to install it at the end of the workshop, you may bring it with you. Otherwise, laptops are not required.

ubuntu_banner

Practical Info:
The duration of the workshop is approximately 2 hours (11:30h-13:30h), including a 10 minute break. Note that this is a free workshop, but you do *need to register*, in order to attend. Please do it, by filling this form: it should only take 2 minutes.
For practical reasons, I will limit the number participants to 20, on a: “first come, first served” basis.

This workshop is hosted by MOB/Made (Calle Bailen 11, Bajos. 08010 BCN) and all donations collected by the bitcoin wallet bellow will be given to Made, a non profit organization.

1GcD6YZLMvV4cNv7WckS22FJKMNdtjQJPE

Bitcoin

Quick guide to Auditing a (postgreSQL) Database: putting it all together

On my previous post, I suggested how to create a schema, a table and a trigger function, in order to audit a PostgreSQL database.
To audit a table, you would have to create a trigger for that table, calling the code from the generic trigger.
In my case, I want to audit every table in the database, and I think most people will likely want to audit every table, or at least most tables in the database.
To escape the tedious task of writing code to implement that n-times, I put together a script that will generate an audit trigger for each table in the database.If you want to apply it to a restricted number of tables instead, you could easily change it to read the table names from a list.

CREATE OR REPLACE FUNCTION create_audit_triggers()
  RETURNS void AS
$BODY$  
 DECLARE 
 r RECORD; 
 _string varchar ( 1000 );	
  BEGIN


FOR r IN SELECT distinct tablename FROM pg_catalog.pg_tables where schemaname='public'  LOOP

	IF NOT EXISTS(SELECT *
			     FROM information_schema.triggers
			     WHERE event_object_table = r.tablename
			     AND trigger_name = r.tablename || '_audit'
			     )
			    THEN

				--raise info '%' , r.tablename;
				_string :=' CREATE TRIGGER ' || r.tablename || '_audit ' ||
				' AFTER INSERT OR UPDATE OR DELETE ON ' || r.tablename ||
				' FOR EACH ROW EXECUTE PROCEDURE audit.if_modified_func();';
				raise info '%', _string; 
				EXECUTE ( _string ) ; 	

	END IF ; 

end loop;

  END;
  $BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;
ALTER FUNCTION update_info_tables2()
  OWNER TO postgres;

This will check if the trigger already exists (for which an error would be raised!), and generate the triggers during the blink of an eye (depending on the size of your database!). Thus you could use it for updating the triggers, after you added a couple of tables in the database.

Quick guide to Auditing a (postgreSQL) Database

According to Wikipedia, ‘Database auditing’ involves observing a database so as to be aware of the actions of database users. A bit like ‘spying on the user activity’.
This of course, could be useful, if you have a database with multiple users and store some sort of ‘confidential’ information.

spy

Previously I implemented some code to do this in SQL Server, that basically involved:

  • creating a table to store the audit information.
  • create an ‘encode scheme’ for this table (i.e.: how to distinguish ‘edits’ from ‘removes’, etc).
  •  creating triggers for every table that I wanted to audit, that do the ‘house-work’ (these were of course, generated from a script)

Moreover, these ‘encoded changes’ where exported to JSON, as they were the basis for a synchronization system that I implemented,

This was working fine, until I had to port the database to another RDBMS, which gave me the opportunity to rethink the structure that I had in place.
Instead of porting directly the code from T-SQL to PSQL, I did a little research and (gladly!) found out that Postgres had an ‘inbuilt’ support for audit. It took me about 10 minutes, to ‘get grips with it”, which is what I describe next.

First thing would be to create a table to store the ‘changes’. The PostgreSQL wiki, actually advises to use a different schema, which I I agree is a good idea.

-- create a schema named "audit"
CREATE schema audit;
REVOKE CREATE ON schema audit FROM public;

CREATE TABLE audit.logged_actions (
schema_name text NOT NULL,
table_name text NOT NULL,
user_name text,
action_tstamp timestamp WITH time zone NOT NULL DEFAULT current_timestamp,
action TEXT NOT NULL CHECK (action IN ('I','D','U')),
original_data text,
new_data text,
query text
) WITH (fillfactor=100);

REVOKE ALL ON audit.logged_actions FROM public;

The stored information is: the relevant schema and table names, the username (so it is important to enforce a user policy here) and a timestamp. Then we also have the ‘type’ of action, that is one of the following: insert, delete or update. This is not supporting triggers or selects, which I guess it’s ok. Then we have the previous value and current value. For inserts, we have a change from nothing to something; for deletes we have a change from something to nothing and from updates we have a change of something to something. In a way, you could figure out the change type from looking at these values, so the ‘action’ field is perhaps a bit redundant; but then, you would have to represent ‘nothing’ as a sort of special value (or keyword) and not an empty space (that could be misunderstood by an empty string).
Finally we have the ‘query’, which is the exact query that triggered this audit. Although this is also a bit redundant, since it could be reconstructed from the other values, it is not a bad idea since it allows to quickly see/reproduce exactly what happened.

Then we can create some indexes:

CREATE INDEX logged_actions_schema_table_idx
ON audit.logged_actions(((schema_name||'.'||table_name)::TEXT));

CREATE INDEX logged_actions_action_tstamp_idx
ON audit.logged_actions(action_tstamp);

CREATE INDEX logged_actions_action_idx
ON audit.logged_actions(action);

The next step is to create the trigger to ‘fill’ this table, on relevant actions:

CREATE OR REPLACE FUNCTION audit.if_modified_func() RETURNS TRIGGER AS $body$
DECLARE
v_old_data TEXT;
v_new_data TEXT;
BEGIN
/*  If this actually for real auditing (where you need to log EVERY action),
then you would need to use something like dblink or plperl that could log outside the transaction,
regardless of whether the transaction committed or rolled back.
*/

/* This dance with casting the NEW and OLD values to a ROW is not necessary in pg 9.0+ */

IF (TG_OP = 'UPDATE') THEN
v_old_data := ROW(OLD.*);
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data,v_new_data, current_query());
RETURN NEW;
ELSIF (TG_OP = 'DELETE') THEN
v_old_data := ROW(OLD.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,original_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_old_data, current_query());
RETURN OLD;
ELSIF (TG_OP = 'INSERT') THEN
v_new_data := ROW(NEW.*);
INSERT INTO audit.logged_actions (schema_name,table_name,user_name,action,new_data,query)
VALUES (TG_TABLE_SCHEMA::TEXT,TG_TABLE_NAME::TEXT,session_user::TEXT,substring(TG_OP,1,1),v_new_data, current_query());
RETURN NEW;
ELSE
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - Other action occurred: %, at %',TG_OP,now();
RETURN NULL;
END IF;

EXCEPTION
WHEN data_exception THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [DATA EXCEPTION] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN unique_violation THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [UNIQUE] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
WHEN OTHERS THEN
RAISE WARNING '[AUDIT.IF_MODIFIED_FUNC] - UDF ERROR [OTHER] - SQLSTATE: %, SQLERRM: %',SQLSTATE,SQLERRM;
RETURN NULL;
END;
$body$
LANGUAGE plpgsql
SECURITY DEFINER
SET search_path = pg_catalog, audit;

The ‘TG_OP’ variable inside a trigger, is a string that tells us for which operation the trigger was fired (INSERT, UPDATE, or DELETE). The rest is fiddling around with the ‘old’ and ‘new’ values.

Actually this is all to it, regarding having the audit ‘structure’ in place for postgresql. To put it ‘in action’, auditing a table is as simple as this:

CREATE TRIGGER fr_frame_audit
AFTER INSERT OR UPDATE OR DELETE ON fr_frame
FOR EACH ROW EXECUTE PROCEDURE audit.if_modified_func();

In the example above, I created a trigger for auditing table ‘fr_frame’ (of course you can create a sp procedure to generate these statements, if you want to generate a trigger for auditing every table in the database…).
Then I went to table ‘fr_frame’ and deleted a row. It got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:36:52.063113+00";"D";"(67,"bin frame",missing,missing,1,1,missing)";"";"DELETE FROM public.fr_frame WHERE id = '67'::integer"

Then I modified and added a row; it got stored like this:

"public";"fr_frame";"postgres";"2014-02-11 12:38:54.793902+00";"U";"(59,"Sampling Frame",missing,"Initial Sampling frame",1,2,missing)";"(59,"Sampling Frame","Sampling Frame","Initial Sampling frame",1,2,missing)";"UPDATE public.fr_frame SET nameeng='Sampling Frame'::text WHERE id = '59'::integer"
"public";"fr_frame";"postgres";"2014-02-11 12:39:14.290898+00";"I";"";"(68,test,test,test,1,2,test)";"INSERT INTO public.fr_frame(name, nameeng, description, id_cloned_previous_frame, id_source, comments) VALUES ('test'::text, 'test'::text, 'test'::text, '1'::integer, '2'::smallint, 'test'::text)"

As I said before, the encode of non-existing values (nothing) as an empty string is not the most accurate approach, but since we have more information to complement it, it works. Also, the fact that it is row-based rather than field-based (as I had in my implementation), originates the serializing arrays, which is not exactly normalized… on the other hand, I can accept it from the point of view of storage.

Overall I think I accept it as a good solution, measuring all the pros and cons, and high fives for being so simple to implement.

You can find this information (and more), in the PostgreSQL wiki.

Building Cross-plattform Applications (for real!)

I have been writing many posts about the Qt library, without making a proper introduction.
Qt is a cross-plattform C++ framework, that provides support for a lot of things such as UI, multi-threading, Graphics, etc. Since C++ itself does not have so many libraries built in as you would find on other libraries, it is a good idea to use something like this. There are other frameworks specific things such as graphics (for instance GTK+), but I don’t know of any as complete as Qt. The .NET framework is quite complete, but unlike Qt it is not cross-platform, neither it has a permissive GPL/LGPL license, so if you care about these two things it is clearly *not* an option.

Having said this, I am quite happy with the Qt libraries in the long run, even things are not as easy as they would seem.

The Windows and Linux environment are different enough, even if you stick to C++ and Qt.
In my Linux environment, I use g++ make (the default compiler), and let the system (aka package manager) to decide what is the most appropriated version. It actually takes care of all the environmental variables and I do not have to worry too much about setting a build environment, whether I use Qt creator (the “native” Qt IDE) or just the command line.
On the other hand, Windows has got a couple of options regarding compilers:

  • there is the mingGW compiler, which people use for portability (but I don’t particularly like it since you need to setup a whole set of tools, that are not native on Windows)
  • there is the native Microsoft Visual Studio toolchain, which is by excellence, a “Windows compiler”
  • there is the Qt creator “jom” compiler, which allows using multi core, but somehow I am a bit reluctant in using it, because I don’t see anybody using it outside Qt creator.

The thoughts and “suspicions” above, are nothing else but that: thoughts and suspicions. Since I have been programming in Windows for quite a while using Microsoft Visual Studio and I am quite happy about it, I just decided to use it in my Windows projects.

If you want to stick to basic configurations, the differences between Windows and Linux projects may be small, but as soon as you start to complicate them a bit it is not so simple anymore. Recently I decided to link my application statically, in order to ease deployment in a complete “user-proof” scenario. These required rebuilding Qt statically, in both OS.
After successfully compiling Qt, I tried to build my project on Linux using Qt creator and it “worked as a charm”. I did not have to change any of the environment variables, but only have to point to which version of Qt I wanted to use inside Qt creator. There were only a few things that I had to keep in mind:

  • My application uses a plugin, so I had to also link it statically (creating a .la)
  • There were some minor changes in the qmake project files, *both* of the plugin and the application, in order to compile them statically.
  • It is better to clean/rebuild the projects, so that we don’t run into the risk of having files *left* from a previous dynamic linking. From my experience, “make clean” is not so tidy, so I would recommend going inside the directories and remove the .obj, or any other intermediate files by hand…

This is the only change that I had to make in the project file of my designer plugin

dynamic linking:

CONFIG += debug_and_release

static linking:

CONFIG += release staticlib

(with static linking, there is generally no point on debugging, so I am generating only a release build)

Then I could compile my application, and link against the static versions of Qt and of this plugin, by slightly modifying the project file:

dynamic linking:

CONFIG += debug_and_release

static linking:

CONFIG += release static

And that was about it: I got a binary file with 26.9 MB, that I can take with me to any Ubuntu system 🙂

Now the Windows part was a bit more painful. First I could not get the Qt Creator to work, since it did not correctly pick up the VC+ settings, and complained about the static version of Qt not having been built with the same compiler I was trying to use. To speed up things, I decided to compile the application on the command line, more specifically inside the Visual studio shell, which is the same place where I compiled Qt.

I had some persistent linking errors regarding a DLL linkage with my plugin. I read somewhere that by default in Windows, qmake will attempt a dll linkage unless its explictly told otherwise; thus I added this flag to the DEFINES of my project files:

DEFINES += QT_NODLL

In Linux I did not run into such a problem. In any case, it did not solve the linking errors. After researching a bit more, I found out that to call a plugin statically, you have to invoke a specific macro on the “main” of the application, *and* include QtPlugin. Another thing that was not necessary in Linux…

#include <QtPlugin>;

Q_IMPORT_PLUGIN(catchinputctrl)

And, finally the plugin has to be *explicitly* added to the project file, in this way:

QTPLUGIN += catchinputctrl

I still had linking errors, this time regarding *not* finding the plugin library; the directive that I had in Linux did not worked, and I messed around a bit with the “-L” and “-l” options in LIBS but without success, so I ended up copying the .lib file to my project directory (a quick fix). After that, I could generate the binary, but not without some linking warnings:

Creating library ..\release\faocas.lib and object ..\release\faocas.exp
frmcatch.obj : warning LNK4217: locally defined symbol ??0CatchInputCtrl@@QAE@PA
VQWidget@@@Z (public: __thiscall CatchInputCtrl::CatchInputCtrl(class QWidget *)
) imported in function "public: void __thiscall Ui_FrmCatch::setupUi(class QWidg
et *)" (?setupUi@Ui_FrmCatch@@QAEXPAVQWidget@@@Z)
frmoperation.obj : warning LNK4049: locally defined symbol ??0CatchInputCtrl@@QA
E@PAVQWidget@@@Z (public: __thiscall CatchInputCtrl::CatchInputCtrl(class QWidge
t *)) imported
frmtrip.obj : warning LNK4049: locally defined symbol ??0CatchInputCtrl@@QAE@PAV
QWidget@@@Z (public: __thiscall CatchInputCtrl::CatchInputCtrl(class QWidget *))
imported
frmcatch.obj : warning LNK4217: locally defined symbol ??1CatchInputCtrl@@UAE@XZ
(public: virtual __thiscall CatchInputCtrl::~CatchInputCtrl(void)) imported in
function "public: virtual void * __thiscall CatchInputCtrl::`scalar deleting des
tructor'(unsigned int)" (??_GCatchInputCtrl@@UAEPAXI@Z)
frmoperation.obj : warning LNK4049: locally defined symbol ??1CatchInputCtrl@@UA
E@XZ (public: virtual __thiscall CatchInputCtrl::~CatchInputCtrl(void)) imported

frmtrip.obj : warning LNK4049: locally defined symbol ??1CatchInputCtrl@@UAE@XZ
(public: virtual __thiscall CatchInputCtrl::~CatchInputCtrl(void)) imported
mt.exe -nologo -manifest "release\faocas.intermediate.manifest" -outputr

The output dir contained my executable, as well as a exports library file and a copy of the lib. These are unnecessary, and I can happily copy my 12.2MB binary around, without having to ship anything else.

Some questions that remain in my mind:

  • Why are the binaries generated by Windows and Linux so different in size? (one almost *doubles* the size of the other)
  • Why is the static and dynamic linking in Windows so different?
  • Why is the static linking in Windows so different from the one in Linux, and why is this so undocumented? (does anybody use it at all??)
Static linked app in Windows

Static linked app in Windows

Static linked app in Linux

Static linked app in Linux

Static Linking?

From time to time, I have this moments when I cannot deploy my application properly and decide that I want to link it statically (then I generally give up, because it requires me to link the Qt libraries statically…). But is it really better to prefer static over dynamic linking?

As in so many other cases, it depends on what you want to do. I read that in terms of performance, there are trade-offs in both approaches, so in the end it really does not matter so much. From my point of view, the biggest advantage of static linking is the fact that you can ship one single file with your application, removing the risk of “broken” dependencies. That is, in terms of deployment, quite an advantage!

On the other hand, if everybody would link statically, we would literally have “thousands” of libraries “repeated” inside our system, packed inside “huge” binaries. It does not make much sense, does it?

Dynamic libraries are also “cool”, because we can (till a certain extent) replace them by newer (improved) versions, without having to recompile our application. That is like a huge benefit, in terms of “bug fixing” of third party libraries.

After removing the performance issue, my verdict would be:

  • For myself, I would like to minimize resource consumption by using as much as possible, shared libraries (dynamic linking).
  • For “bullet proof” systems, where users are not experienced in installing software, and are likely to “mess up” the system by removing parts of it, I would consider providing them statically compiled versions of the software, instead. The software will likely be “bigger “(although there are tools to minimize this, such as UPX), and a bit more “hungry” of resources, but this is also the only way to prevent the DLL hell.

Finally, it is important to mention that the type of linking may be conditioned by licensing issues.  For instance due to the “nature” of the license, GPL libraries would “contaminate” any software statically linked with them.

Cross-compiling, using Qt

On a previous post I described a simple example, of how-to cross compile an Windows application in a Linux environment.
Although this worked fine and raised my motivation, it is not a very useful example since the real world is much more complex. One of the complexities that I want to introduce is the use of Qt libraries, that are used thoroughly in my projects. I will now go in detail, through a slightly more complicated “Hello World”, where I make use of these libraries.

A key thing to understand, when cross compiling with Qt, is that you need to setup the development environment, both in the host and in the target OS (those are in this case: Ubuntu and Windows).

I started by downloading the Qt distribution, compiled for MinGW (of course, you may also decide to download the source codes and compile it yourself, which will add another complexity layer to this task…):

wget http://download.qt-project.org/official_releases/qt/4.8/4.8.5/qt-win-opensource-4.8.5-mingw.exe

Then I ran this with Wine, and followed the installation instructions:

wine qt-win-opensource-4.8.5-mingw.exe

Qt got installed under my “wine drive”, in the $HOME directory:

~/.wine/drive_c/Qt/4.8.5

As I mentioned before, you also need a working Qt environment for Linux. I had Qt 4.8.7 installed in my machine, and I stupidly thought I could use that; it turns out I can not. The Qt versions in Windows and Linux need to match! Since MinGW did not released a 4.8.7 version yet, I decided to download the Linux version of Qt 4.8.5, and installed it along with my current version (let us see what problems this may bring me in the future, if I keep developing in 4.8.7!).
If you did not understand my grump in the last paragraph, the only thing you need to know is that I downloaded Qt 4.8.5 for Linux, and installed it in:

/usr/local/Trolltech/Qt-4.8.5

The next step is to configure the mkspecs for cross-compiling; for that you need to navigate to /usr/local/Trolltech/Qt-4.8.5/mkspecs and create a new directory there. I called it “win32-x-g++”.

/usr/local/Trolltech/Qt-4.8.5/mkspecs/win32-x-g++

Inside this directory, you need to place a version of qmake.conf, with all the necessary information for cross compiling; mainly you need to pay attention to the paths of the Qt installed in Wine, and the minGW paths in Linux. Here is a working version for me; please amend it to reflect your paths.

Then I wrote a very simple Qt “Hello World”:

// Hello World in C++ for the Qt framework
#include 
#include 
int main(int argc, char *argv[])
{ 
	QApplication a(argc, argv); 
	QLabel l("Hello World!", 0); 
	l.setText("Test"); 
	l.setAlignment(Qt::AlignCenter); 
	l.resize(300, 200); 
	l.show();
 return(a.exec());
}

This is just a Window with “Hello World” written, but it is calling the Qt Libraries.
My Qt project file, looks like this (I generated it with qmake, and then added the relevant libraries to the Qt variable):

######################################################################
# Automatically generated by qmake (2.01a) Fri Dec 20 13:18:06 2013
######################################################################

TEMPLATE = app
TARGET = hello
QT += core gui
CONFIG += qt
DEPENDPATH += .
INCLUDEPATH += .

# Input
SOURCES += hello.cpp

I make a small parenthesis here, to note that the qmake version should be the one on:

/usr/local/Trolltech/Qt-4.8.5

If you do have more than one Qt version (like I do!), it is worth to set the $QTDIR variable to this directory, while cross compiling. You can check if this working as expected, by checking the path of qmake:

which qmake

If this points to /usr/local/Trolltech/Qt-4.8.5, then everything is ok; otherwise, please take a moment to export your $QTDIR.

To generate the makefiles, using the created mkspecs file above, you can use the following syntax:

qmake -spec win32-x-g++

Then you can compile the application, using make. If the compilation was successful, you should be able to run the executable “hello.exe”, using wine. When I attempted to do this, I had some errors that related to not finding some libraries. As in my previous example, I copied “mingwm10.dll” to the current directory, but I still had errors related to Wine not finding the Qt libraries.

I edited the QTDIR and PATH variables on Wine(Windows), by calling regedit.

regedit

Unfortunately that still did not do it for me. I thought a quick fix, for forcing the application to pick the libraries would be to copy the required DLL’s to the app path. Unfortunately that also did not worked 😦

err:module:import_dll Library libgcc_s_dw2-1.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\QtCore4.dll") not found
err:module:import_dll Library QtCore4.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\hello.exe") not found
err:module:import_dll Library libgcc_s_dw2-1.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\QtCore4.dll") not found
err:module:import_dll Library QtCore4.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\QtGui4.dll") not found
err:module:import_dll Library libgcc_s_dw2-1.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\QtGui4.dll") not found
err:module:import_dll Library QtGui4.dll (which is needed by L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\hello.exe") not found
err:module:LdrInitializeThunk Main exe initialization for L"Z:\\home\\joana\\projects\\cross-compiling\\hello\\release\\hello.exe" failed, status c0000135

Then I thought, that even if I cannot run the executable with Wine, it does not necessarily mean that I did not generate it correctly (which is ultimately my purpose, with all of this!). So I went to a Windows guest in VirtualBox, and I managed to successfully run the application! 🙂

hello1

This was a nice outcome, however it is still necessary to do this for the entire project, tracking all required Qt dll’s, Boost, and any other needed libraries. It would probably be a good idea to consider static linking, since I would not have to ship the DLL’s, and would also avoid digging into the problem of “why Wine is not finding the shared libraries”. However, from my experience, static linking of the Qt framework is quite an onerous task, and in this case I would have to do it twice (for Linux and for Windows).