Cluster Explorer Demo

Cluster explorer is a piece of software I was working on, which blends machine learning and GIS to produce a cluster visualization on top of a virtual globe.
The virtual globe uses NasaWorldWind, a Java framework for 3D geo-visualization based on OpenGL, while the clustering algorithms use the ELKI Data mining framework for unsupervised learning.
The tool allows to cluster a bunch of points (for instance geolocated Tweets) using one of two algorithms (or both), and explore the contextual information provided by the geographic layers (e.g.: OSM, Bing).


Driving Spatial Analysis Beyond the Limits of Traditional Storage

My presentation on the “Conference on Advanced Spatial Modelling and Analysis” consisted in some thoughts regarding Big Spatial Data, and how to handle them in terms of modern technologies.

It was great to see such a motivated crowd from all generations, and to get to know the research developed by CEG, in topics such as Agent Based Modelling and Neural Networks. It was also great to talk again to Arnaud Banos, from the Complex System Institute of Paris Ile-de-France (ISC-PIF).



Recently I had another challenge, which I believe has the characteristics to be a common problem. I have a table with attributes, in CSV format, one of which is geospatial.

CSV is a structured format for storing tabular data (text and numbers), where each row corresponds to a record, and each field is separated by a known character(generally a comma). It is probably one of the most common formats to distribute that, probably because it is a standard output from relational databases.

Since people hand me data often in this format, and for a number of reasons it is more convenient for me to to use JSON data, I thought it would be handy to have a method to translating CSV into JSON, and this was the first milestone of this challenge.

The second milestone of this challenge, is that there is some geospatial information within this data, serialized in a non standard format, and I would like to convert it into a standard JSON format for spatial data; e.g.: GeoJSON. So the second milestone has actually two parts:

  • parse a GeoJSON geometry from the CSV fields
  • pack the geometry and the properties into GeoJSON field

To convert CSV (or XML) to JSON, I found this really nice website. It lets you upload a file, and save the results into another file,so I could transform this:


into this:

"TMC": "E17+02412",
"StartLatitude": "41.5368273",
"StartLongitude": "0.4387071",
"EndLatitude": "41.5388396",
"EndLongitude": "0.4638462"

This gave me a nicely formatted JSON output (the first milestone!), but as you can notice the geometry is not conform with any OGC standards. It is actually a linestring, which is defined by a start point (StartLongitude, StartLatitude) and an end point (EndLongitude, EndLatitude).

According to the JSON spec, a linestring is defined by an array of coordinates:

So the goal would be to transform the geometry above into:

"coordinates": [
[0.4387071, 41.5368273], [0.4638462, 41.5388396]

Once more, jq comes really handy to this task.

The JSON can be transformed into a feature using this syntax:

cat tramos.json | jq -c '[.[] | { type: "Feature", "geometry": {"type": "LineString","coordinates": [ [.StartLongitude, .StartLatitude| tonumber], [ .EndLongitude, .EndLatitude | tonumber] ] }, properties: {tmc: .TMC, roadnumber: .ROADNUMBER, dir: .DIR, prov: .PROV, ccaa: .CCAA}}]' > tramos.geojson

Since the JSON converser parse all the variables into strings, it is important to pass a filter (tonumber) to make sure that the coordinate numbers are converted back into numbers.

"properties": {
"ccaa": "CATALUNYA",
"prov": "LLEIDA",
"roadnumber": "A-2",
"tmc": "E17+02413"
"geometry": {
"coordinates": [
"type": "LineString"
"type": "Feature"

Since we are creating an array of features (or “Feature Collection”), to be conform with GeoJSON, it is important to declare the root element too, by adding this outer element:

{ "type": "FeatureCollection","features": [ ]}

The result should be a valid GeoJSON, that you can view and manipulate in your favourite GIS (for instance QGIS!) 🙂


JSON to GeoJSON with jq

A lot of people and institutions have already made the jump of providing data in JSON, which is great, since it is an inter-operable standard and a semi-structured form of data. However when it comes down to geographic data, standards don’t seem to be so common. I have seen many different ways of encoding geospatial information within JSON, normally involving listing an array of coordinates, with or without name fields for lat and long. Rarely there is any CRS associated to this data (which could be ok, for the case that it uses WGS84), or any mention of the geometry type.

This information is more or less useless, without some pre-processing to convert it into a “GIS-friendly” format, that we could use in QGIS, GeoServer, or even R.

Since we are dealing with JSON, the natural thing would be to convert it into GeoJSON, a structured format for geographic data. And the perfect tool for doing this is jq, a tool that I mentioned in a previous post. To make it simpler to understand I will explain what I did a specific JSON dataset, but with some knowledge of jq (and GeoJSON), you could literally apply it to any JSON dataset with geographic information within it.

My test dataset was the description of a set of roads, from the city of zaragoza,.

The description of the dataset says that it is in “Google” format, which one could erroneous interpret as spherical mercator, but the name of the file suggests WGS84, and a quick look at the coordinates can confirm that too. This is literally a list of tracks, each one containing a list of coordinates that define the geometry. Let us look at an example:

  "points": [                                                                                                                                        
      "lon": -0.8437921499884775,                                                                                                                    
      "lat": 41.6710232246183
      "lon": -0.8439686263507937,                                                                                                                    
      "lat": 41.67098172145761
      "lon": -0.8442926556112658,                                                                                                                    
      "lat": 41.670866465890654
      "lon": -0.8448464412455035,                                                                                                                    
      "lat": 41.67062949885585
      "lon": -0.8453763659750164,                                                                                                                    
      "lat": 41.67040130061031
      "lon": -0.8474617762602581,
      "lat": 41.669528132440355
      "lon": -0.8535340031154578,
      "lat": 41.66696540067222
  "id": 5

So the task here would be to convert this into a GeoJSON geometry (a linestring). For instance:

  { "type": "LineString",
    "coordinates": [ [100.0, 0.0], [101.0, 1.0] ]

In jq, we want to loop through the array of roads, and parse the lat, long coordinates of each road object. This coordinates are themselves another array. If we do something like this:

cat tramoswgs84.json | jq  '.tramos[2]| .points[].lon,.points[].lat'

We are asking for the longitude and latitude coordinates of track 2, but since jq evaluates expressions from left to right, it will gives us back the array of longitude coordinates and the array of latitude coordinates, not the pairs.

The key thing is to use map, that will run the filter for each element of the array:

cat tramoswgs84.json | jq  '.tramos[2].points| map([.lon,.lat])'

The complete jq syntax for generating one linestring object, would be:

cat tramoswgs84.json | jq  -c '.tramos[1]|  {"type": "LineString", "coordinates": .points | map([.lon,.lat])}'

The next step would be to create a GeoJSON containing the entire collection of linestrings. Since we would like to attach attributes to them (“name” and “id”), we rather generate a “feature collection“, than a “geometric collection”. The code for generating each feature would be:

cat $1 | jq  -c '.tramos[]| {"type": "Feature","geometry": {"type": "LineString", "coordinates": .points | map([.lon,.lat])},"properties":{"name": .name, "id": .id}}' >> $2

And then we need to do a few text manipulation operations, that I could not find a way of performing with jq. Basically we need to add the opening tags for the feature collection, commas between each object, and then add closing tags for the feature collection.

I did this text manipulating tasks with sed, and put everything inside a shell script, that will transform the JSON file (in the format that I described) directly into a valid GeoJSON. If you are interested, you can get it from github. The resulting file can be fed to QGIS, in order to produce pretty maps, like the one bellow 🙂


Piping an API into R: a Data Science Workflow

Inspired by @jeroenhjanssens, author of the Data Science Toolbox, I decided to give a go to one of the most unfriendly data sources: An XML API.
Apart from its rich syntax with query capabilities, I tend think XML is highly verbose and human unfriendly, which is quite a discouraging if you don’t want to take advantage of all its capabilities. And in my case I didn’t: I just wanted to grab a data stream, in order to be able to build some analysis in R. APIs are generally a pain for data scientists, because they tend to want to have “a look at things” and get a general feeling of the dataset, before start building code. Normally, this is not possible with an API, unless you use these high-end drag-and-drop interfaces, that are generally costly. But following this approach I was able to setup a chain of tools that enable me to reproduce this AGILE workflow, where you can have a feel of the dataset in R, without having to write a Python client.

The first step was to pipe the xml output of the query into a file, and that is easy enough to do with curl

curl -s '' > out.xml

Now, if you are an XML wiz you can follow a different approach, but I personally feel more comfortable with JSON, so the next step for me was to convert the XML dump into some nice JSON, and fortunately there is another free tool for that too: xml2json

xml2json < out.xml > out.json

Having the JSON, it is possible to query it using jq, a command line JSON parser that I find really intuitive. With this command, I am able to narrow the dataset to the fields I am interested, and pipe the results into another text file. In this case I am skipping all the “headers”, and grabbing an array of elements, which is what I want to analyse.

cat out.json | jq '[.Root.ResultSet.Entity[] | {color: .color, width: .with, average: .average, reference: .reference, Time: .Time}]' > test.json

Now here I could add another step, to convert the JSON results into csv, but actually R has interfaces to JSON, so why not use those to import the data directly. There is actually more than one package that can do this, but I had some nice results with jsonlite.

data1 <- fromJSON("test.json")

And with these two lines of code, I have a data frame that I can use for running ML algorithms.


Data Mining| Machine Learning

Together with a colleague, I have been involved in the “hard” task of drafting a diagram (or a “mindmap”) that would connect logically, some of the “buzz words” regarding “data science”; e.g.: artificial intelligence, machine learning, data mining, recommenders. Moreover we wanted to provide a classification that would organize the different “algorithmic families” into some sort of typology. Hard task, I know, mostly because there are many classifications, based on the approaches we want to take; e.g.: by learning method, by task. We ended up not with one diagram, but with two, separating “data mining” and “machine learning”, in order to explain them better.

In the “Data mining” diagram, we include a general distinction between “descriptive” and “predictive” data mining, and within these two, we follow with sub divisions that finish in data mining techniques that may or not belong to machine learning (e.g.: statistics). On the bottom of the diagram, we represent the generic data mining applications, that make use of these techniques. One key difficulty in drafting this diagram is the fact that some techniques can include other techniques, and it is not easy to reflect that in the diagram. For instance, machine learning techniques typically make use descriptive statistics such as dispersion or central tendency.

Data Mining

In the “Machine learning” diagram we went for a more “scientific” view (less problem oriented), and tried to show how machine learning fits into the broader field “Artificial Intelligence”. Then we took the “learning approach”, as a way of classifying ML techniques. At the leafs of this tree, as well as at the leafs of the “Data Mining tree”, there are examples of techniques/algorithms relevant for the specific types; it is not an *extensive* list of algorithms, neither it claims to select the most *important* algorithm (if there is such thing…); sometimes the criteria for choosing the algorithm is greatly *subjective*: because we worked or read about it, or even because it was the only example we could find…

Machine Learning

Clearly there is some degree of overlap between the two diagrams. Machine learning is part of Data Mining, and therefore some algorithmic “families” are presented in both diagrams. However we believe that in this way, it becomes easier to describe what “machine learning” is, as a scientific discipline, and how it “fits & mixes”, within the “wide umbrella” of data mining.

This diagram was based on a lot of reads (mostly blogs), on our own knowledge and a lot of discussion. It is not “written on stone”, and I don’t even know if it is possible to have such a thing, regarding a topic that is so difficult to classify, either because it is evolving so fast or because it is often very “fuzzy”. In any case, any (constructive) critics or commentaries regarding ways of improving these diagrams, or even just some thoughts would be greatly appreciated.

Importing Large Spatial Datasets into PostGIS

Is PostgreSQL/PostGIS suitable for Big Data? This is a question I have been asking myself for a while, and I now have the opportunity to test it with a PostgreSQL instance from Amazon Web Services (AWS); after all, if such a service cannot deal with “Big Data”, who can?
I have a dataset with Tweets in the city of Barcelona, over a time period. The entire dataset amounts to about 3 million points (82 MB), and I have some subsets: a smaller collection of about 2 million points (55 MB) and 300 000 points (8.2. MB).


We can argue whether this dataset is, or not, Big Data; but IMHO, a table with more than 1 million records starts to get “heavy” for any relational database. Now before doing any computations of spatial algorithms (e.g.: convex hull, point in polygon), the very first step is to actually load this dataset into the database. And this is where I started to struggle.

QGIS has a plugin called SPIT, that allows to import a Shapefile into PostGIS. Unfortunately it seems to be extremely slow any of the three files that I mentioned above (including the smaller one).


Switching to the command line, where I have more control knowledge about the processes that are running, I tried shp2pgsql, a tool for importing Shapefiles that comes with the PostGIS installation.

On Linux it is possible to convert and import the Shapefile in one go, redirecting the output with pipe:

shp2pgsql -s <SRID> -c -D -I <path to shapefile> <schema>.<table> | \
psql -d <databasename> -h <hostname> -U <username>

Since this command was really slow (unresponsive?) I decided to do it in two steps, to see where it was hanging:

  • Create an sql file with the insert statements.
  • Load this sql file into the database.

Unsurprisingly, the SQL file with the statements was generated quite quickly, and the processing effort was being spent on the insert queries. This situation was verified with the three Shapefiles, even when I inserted a spatial index (Gist). After thinking a bit, I came to the conclusion that the spatial index should be created within the “CREATE TABLE” statement, so it should actually speed up the queries a bit (but I did not confirm this guess).

My next attempt, was to use ogr2ogr, the GDAL tool to convert between vector formats.

ogr2ogr -f "PostgreSQL" PG:"host=host user=user dbname=db password=pass" data.shp

I have to say that I read some suggestion to prefer shp2pgsql over this tool, and in fact it was quite slow as well (at this point I was only testing the smallest of the Shapefiles).

My conclusion up to this moment, is that it is quite expensive in terms of time (and processor) to import a large dataset into a PostGIS. Some ideas that could be explored:

  • Upgrade the PostGIS server (would scale vertically make any difference?).
  • Upload the data into the Amazon server and run the import tool from there. I am not sure whether it is possible to do that, but it seems like a good improvement to do this operation locally, and using the power of the Amazon Server, rather than my desktop computer.
  • Split the Shapefile into chunks of data. I am not entirely sure how to do this, and any suggestions would be welcome.

Next steps: once I manage to import a large Shapefile into my PostGIS instance, I will post some feedback about the computation of spatial operations.

Update: Apparently I underestimated the fact that I am importing data over a network connection, with a limited latency. The best thing would be to make the import locally, an option that I unfortunately don’t have, since the Amazon RDS does not give me SSH access (it is a service, not a machine!). I decided to switch to a new approach: import a plain csv file into a database table, using the /copy command, and that was actually really fast (a few seconds).

psql database -U user -h -c "\copy newt_table from 'data.csv' with DELIMITER ','"

Of course, I then had to run a query populate the point geometry from the float (x,y) values, and create an update a spatial index (for a matter efficiency it is recommended that indexes are created after, and not during the import!).

UPDATE new_table SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

CREATE INDEX [indexname] ON [tablename] USING GIST ( [geometryfield] );

VACUUM ANALYZE [table_name] [column_name];

That was still considerably fast (although not as much as the csv import!), and my guess is: it is because I are now running queries on the remote machine (and not uploading data via an internet connection).

Following some very usefull on Stack Overflow, I decided to retry my previous approaches, tuning the configuration parameters.

First I tried the OGR command line tool, but this time with the config flag “PG_USE_COPY YES”, to force dumping of the data:

ogr2ogr -progress -f PostgreSQL PG:"dbname='database' user='user' host='' port='5432' data.shp --config PG_USE_COPY YES -nlt POINT -nln import_ogr

Although speed has improved massively, it still took me 1:03:00 to run this query.

Then I tried the shp2pgsql tool, with the “-D” flag, which also forces the dump of data:

shp2pgsql -D -s 4326 data.shp import_shp2pgsql | psql

This was remarkably fast: 00:01:04


Comparing the two imports, we can see that shp2pgsql did not create any index (which we already commented is a costly operation), while ogr2ogr created a gist index on the spatial field. The extra step of creating an index, has to be added to the shp2pgsql import, but as we have seen before, that operation is not a problem since we are then operating on the server.

I conclude this review, stating that shp2pgsql has a great performance in the task of  loading a Big Data spatial table into a remote PostGIS server (e.g.: Amazon RDS). It is recommended that the “-D” flag is passed to the tool, and that no indexes are created during this operation.

Bellow we can see the PostGIS geometry of the imported table, with more than 3 million points:


Post-processing OPTICSxi Clustering

On a previous post, I expressed my concerns regarding the results of OPTICSxi clustering. Namely, I mentioned an “annoying” spike effect, that turns out massively almost at any simulation (so massively that it is almost a “feature”).

A post in StackOverflow, originated an useful exchange of ideas with the author of the ELKI framework. Namely he pointed out a “weakness” of the algorithm, that basis its partition on the reachability distance, which is not always a synonymous of “spatial closeness”. Literally, outliers that are standing in the middle of clusters, could be erroneous misinterpretated as belonging to a cluster or the other.

Since this is a problem on the partition algorithm, the solution could pass through improving the partition algorithm, using a different partition algorithm, or using a different cluster algorithm all together.  As a “quick fix” I opted to some cluster “post-processing”, in order to remove the outliers.

So my research question was: how to identify a point that is a spatial outlier?

I tried a couple of approaches that I will discuss now, as I think they may be useful for someone or generate an useful discussion.


On the image above, the coloured points are outliers, according to our definition. One simple approach, would be to calculate the average path length, the average distance of one point, to all other points in the point cloud. Then we could test the distance of each point, and say if it is greater than a certain value, let us say 5,  we would consider it an outlier and remove it from the dataset. I found this approach actually yield good results, and was able to reduce the “spikes” that we see in this figure:


To this results:


The code that calculates the average distance for each point, is bellow:

public static double getAverageDistance(Coordinate coord, ModifiableDBIDs ids, HashMap dataMap){

double sum=0.0;

for (DBIDIter iter = ids.iter();
iter.valid(); iter.advance())
catch( Exception  exp) {
System.out.println( "Unexpected exception:" + exp.getMessage() );
return sum/ids.size();

From the point of view of “correctness” this algorithm suffers of the “bottleneck” of removing the outlier from the clusters (and thus converting it into “noise”). To avoid this, it would have to be tested if the point actually belongs to another cluster.

Apart from that, in terms of computation the algorithm is extremely “costly” , being the most costly part when it computes the distances from all points to all points, literally a matrix of NxN that can easily increase to huge numbers with a large cluster.

To avoid that, I tried a few different approaches. One was to calculate this distances for a part of the dataset. We know that by the way the algorithm is written, the outliers tend to “appear” either in the beginning or the end of the cluster. Having this in mind, I calculated the average based on the 60% “middle” values (n<20% and n>20%) and tested the condition for the first 20% and last 20% (this refinement was actually not needed as the “costly” part of the algorithm is the distance matrix and not the condition testing). I was not able to reach any reasonable results with this approach, either because the points were not ordered (which defeats the all purpose of my “slicing”) or because outliers were appearing outside these “classes” (i.e.: in the middle of the dataset).

The other approach that I tried was to work with the “final” polygon (the convex hull of the cluster), rather than the raw points. The polygon border has only a few points, when compared to the point cloud used to generate it. It is very easy and quick to identify the “outlier” in the polygon border; however removing it, will understandable result in a “strange polygon”, since the reality is: if that point would not be used to build the polygon, another point (that we don’t have right now!) would be used, and so the real geometry would not be this one. This is particularly noticeable when we see overlapping clusters (which don’t overlap anymore). After this experience, it became clear that the processing would have to be done in the cluster point dataset, before building the polygon.

I ended up with an algorithm that successfully removes the “spike” effects from OPTICSxi, but is rather costly in terms of time (more costly than OPTICSxi itself) and unfortunately this grows exponentially with the size of the dataset, which limits its application with big data.


The approach above was improved, by testing the distances against the 2 neighbours of each point (previous and next point) rather than against the entire matrix. With this hack, the running time of the algorithm was reduced to reasonable values, that don’t grow up so much with the size of the vector.