Go On board with a GeoNetwork Container

GeoNetwork is a FOSS catalog for geospatial information. It is used around the world by organizations such as FAO, the Dutch Kadaster or Eurostat, just to mention a few.

As any software service, it may not be trivial to install and configure, which may put people away for giving it a try. This could change with docker.


Docker, which could be defined in a nutshell as infrastructure as code, automates the deployment of Linux applications inside software containers. It relies in a technology, LXC, which provides operating-system-level virtualization on Linux. In less than four years it experienced a massive adoption by the software community, and it has already been taken to production in many use cases.

The docker hub is a massive repository for ready-to-use images. You can find anything from web servers to databases, or even actual operative systems. With a docker pull at the tip of your fingers, you can have them running in your computer in a matter of minutes (depending on your internet connection).

Anyone can upload their docker images to docker hub, but there are some images which are released “officially”.
Official images sources live in the docker repositories, and they are considered good to use (and reuse), because they implement docker best practices, and therefore their code can be seen as an example. They are also heavily documented according to some standards, and they go through a security audit.

Although there are a couple of geonetwork images on the docker repositories, there is no official image yet, so I decided to create one. While the image goes through the approval process, I decided to publish it anyway, so that anyone can benefit from it in the meantime.

These images provides the two latest releases of geonetwork (3.0.5 and 3.2.0), as well as the previous release (3.0.4). By default, geonetwork runs on a local h2 database, but I created a variant which can use a postgresql database as backend, either running on a container or on a bare metal server. This should make it more fit for production.

You can read more about these and other features, such as setting and persisting the data directory, on the docker hub page.

Once the official images get released I will make an announcement here. But in the meantime, there is no excuse to not start playing with geonetwork:

docker pull geocat/geonetwork



Have fun with docker & Geonetwork !🙂

Watching a Server through a Container

Lately I have been working a lot with docker, the new kid on the block on cloud computing, which is winning the heart of sysadmins, as well as developers.

The main idea is to setup a Spatial Data Infrastructure, something that has been at the core of other projects such as Georchestra.

Unfortunately having something running on a server is normally not a complete smooth experience, and this sets the ground for the need of a monitoring service.

After searching a bit, I found NewRelic, which provides monitoring on a service basis. I really liked the advanced functionality and the completeness of the dashboards, so it was not hard to convince myself to try it.

NewRelic provides two types of monitoring: application monitoring, and server monitoring, which is what I will cover today on this post. The server monitor, is basically a daemon that runs on the server and collects statistics about various metrics, such as: memory usage, CPU usage, bandwidth, etc. But what really caught my eye about this solution, was the ability to monitor the docker daemon and the different containers that run within it.

Unfortunately this functionality appears to be broken for docker 1.11 (my current version), but with the help of the NewRelic engineers I was able to apply a workaround.

My next step was to dockerize this solution. After all, wouldn’t it be great to spin another container in my SDI, that would monitor the other containers AND the server?

The bad news is that the existing images of Newrelic’s server on docker hub do not implement the workaround. So I went and implement my own image.

You can pull this image from the repository, with:

docker pull doublebyte/newrelic_sysmond

Then you can run it with:

 docker run -d \
–privileged=true –name nrsysmond \
–pid=host \
–net=host \
-v /sys:/sys \
-v /dev:/dev \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /var/log:/var/log:rw \

The privileged flag and the bindings to the host directories are necessaries, because we need to be able to watch the docker daemon, and collect the docker metrics.

Note that if you also want to collect memory stats of the containers, it is necessary to configure it in the kernel. The procedure is explained on the docker documentation, but it really comes down to updating the bootloader and restarting. In the case of grub, you would need to add this line to /etc/default/grub:

GRUB_CMDLINE_LINUX=”cgroup_enable=memory swapaccount=1″

Then you need to update grub with:


After a restart of the server, the docker memory statistics should be present on the server dashboard:


Spatial Data Mining

Social media streams may generate massive clouds of geolocated points, but how can we extract useful information from these, sometimes huge, datasets? I think machine learning and GIS can be helpful here.

My PechaKucha talk at DataBeers : “Visualizing Geolocated Tweets: a Spatial Data Mining Approach”.

Cluster Explorer Demo

Cluster explorer is a piece of software I was working on, which blends machine learning and GIS to produce a cluster visualization on top of a virtual globe.
The virtual globe uses NasaWorldWind, a Java framework for 3D geo-visualization based on OpenGL, while the clustering algorithms use the ELKI Data mining framework for unsupervised learning.
The tool allows to cluster a bunch of points (for instance geolocated Tweets) using one of two algorithms (or both), and explore the contextual information provided by the geographic layers (e.g.: OSM, Bing).

Driving Spatial Analysis Beyond the Limits of Traditional Storage

My presentation on the “Conference on Advanced Spatial Modelling and Analysis” consisted in some thoughts regarding Big Spatial Data, and how to handle them in terms of modern technologies.

It was great to see such a motivated crowd from all generations, and to get to know the research developed by CEG, in topics such as Agent Based Modelling and Neural Networks. It was also great to talk again to Arnaud Banos, from the Complex System Institute of Paris Ile-de-France (ISC-PIF).



Recently I had another challenge, which I believe has the characteristics to be a common problem. I have a table with attributes, in CSV format, one of which is geospatial.

CSV is a structured format for storing tabular data (text and numbers), where each row corresponds to a record, and each field is separated by a known character(generally a comma). It is probably one of the most common formats to distribute that, probably because it is a standard output from relational databases.

Since people hand me data often in this format, and for a number of reasons it is more convenient for me to to use JSON data, I thought it would be handy to have a method to translating CSV into JSON, and this was the first milestone of this challenge.

The second milestone of this challenge, is that there is some geospatial information within this data, serialized in a non standard format, and I would like to convert it into a standard JSON format for spatial data; e.g.: GeoJSON. So the second milestone has actually two parts:

  • parse a GeoJSON geometry from the CSV fields
  • pack the geometry and the properties into GeoJSON field

To convert CSV (or XML) to JSON, I found this really nice website. It lets you upload a file, and save the results into another file,so I could transform this:


into this:

"TMC": "E17+02412",
"StartLatitude": "41.5368273",
"StartLongitude": "0.4387071",
"EndLatitude": "41.5388396",
"EndLongitude": "0.4638462"

This gave me a nicely formatted JSON output (the first milestone!), but as you can notice the geometry is not conform with any OGC standards. It is actually a linestring, which is defined by a start point (StartLongitude, StartLatitude) and an end point (EndLongitude, EndLatitude).

According to the JSON spec, a linestring is defined by an array of coordinates:

So the goal would be to transform the geometry above into:

"coordinates": [
[0.4387071, 41.5368273], [0.4638462, 41.5388396]

Once more, jq comes really handy to this task.

The JSON can be transformed into a feature using this syntax:

cat tramos.json | jq -c '[.[] | { type: "Feature", "geometry": {"type": "LineString","coordinates": [ [.StartLongitude, .StartLatitude| tonumber], [ .EndLongitude, .EndLatitude | tonumber] ] }, properties: {tmc: .TMC, roadnumber: .ROADNUMBER, dir: .DIR, prov: .PROV, ccaa: .CCAA}}]' > tramos.geojson

Since the JSON converser parse all the variables into strings, it is important to pass a filter (tonumber) to make sure that the coordinate numbers are converted back into numbers.

"properties": {
"ccaa": "CATALUNYA",
"prov": "LLEIDA",
"roadnumber": "A-2",
"tmc": "E17+02413"
"geometry": {
"coordinates": [
"type": "LineString"
"type": "Feature"

Since we are creating an array of features (or “Feature Collection”), to be conform with GeoJSON, it is important to declare the root element too, by adding this outer element:

{ "type": "FeatureCollection","features": [ ]}

The result should be a valid GeoJSON, that you can view and manipulate in your favourite GIS (for instance QGIS!)🙂


JSON to GeoJSON with jq

A lot of people and institutions have already made the jump of providing data in JSON, which is great, since it is an inter-operable standard and a semi-structured form of data. However when it comes down to geographic data, standards don’t seem to be so common. I have seen many different ways of encoding geospatial information within JSON, normally involving listing an array of coordinates, with or without name fields for lat and long. Rarely there is any CRS associated to this data (which could be ok, for the case that it uses WGS84), or any mention of the geometry type.

This information is more or less useless, without some pre-processing to convert it into a “GIS-friendly” format, that we could use in QGIS, GeoServer, or even R.

Since we are dealing with JSON, the natural thing would be to convert it into GeoJSON, a structured format for geographic data. And the perfect tool for doing this is jq, a tool that I mentioned in a previous post. To make it simpler to understand I will explain what I did a specific JSON dataset, but with some knowledge of jq (and GeoJSON), you could literally apply it to any JSON dataset with geographic information within it.

My test dataset was the description of a set of roads, from the city of zaragoza,.


The description of the dataset says that it is in “Google” format, which one could erroneous interpret as spherical mercator, but the name of the file suggests WGS84, and a quick look at the coordinates can confirm that too. This is literally a list of tracks, each one containing a list of coordinates that define the geometry. Let us look at an example:

  "points": [                                                                                                                                        
      "lon": -0.8437921499884775,                                                                                                                    
      "lat": 41.6710232246183
      "lon": -0.8439686263507937,                                                                                                                    
      "lat": 41.67098172145761
      "lon": -0.8442926556112658,                                                                                                                    
      "lat": 41.670866465890654
      "lon": -0.8448464412455035,                                                                                                                    
      "lat": 41.67062949885585
      "lon": -0.8453763659750164,                                                                                                                    
      "lat": 41.67040130061031
      "lon": -0.8474617762602581,
      "lat": 41.669528132440355
      "lon": -0.8535340031154578,
      "lat": 41.66696540067222
  "id": 5

So the task here would be to convert this into a GeoJSON geometry (a linestring). For instance:

  { "type": "LineString",
    "coordinates": [ [100.0, 0.0], [101.0, 1.0] ]

In jq, we want to loop through the array of roads, and parse the lat, long coordinates of each road object. This coordinates are themselves another array. If we do something like this:

cat tramoswgs84.json | jq  '.tramos[2]| .points[].lon,.points[].lat'

We are asking for the longitude and latitude coordinates of track 2, but since jq evaluates expressions from left to right, it will gives us back the array of longitude coordinates and the array of latitude coordinates, not the pairs.

The key thing is to use map, that will run the filter for each element of the array:

cat tramoswgs84.json | jq  '.tramos[2].points| map([.lon,.lat])'

The complete jq syntax for generating one linestring object, would be:

cat tramoswgs84.json | jq  -c '.tramos[1]|  {"type": "LineString", "coordinates": .points | map([.lon,.lat])}'

The next step would be to create a GeoJSON containing the entire collection of linestrings. Since we would like to attach attributes to them (“name” and “id”), we rather generate a “feature collection“, than a “geometric collection”. The code for generating each feature would be:

cat $1 | jq  -c '.tramos[]| {"type": "Feature","geometry": {"type": "LineString", "coordinates": .points | map([.lon,.lat])},"properties":{"name": .name, "id": .id}}' >> $2

And then we need to do a few text manipulation operations, that I could not find a way of performing with jq. Basically we need to add the opening tags for the feature collection, commas between each object, and then add closing tags for the feature collection.

I did this text manipulating tasks with sed, and put everything inside a shell script, that will transform the JSON file (in the format that I described) directly into a valid GeoJSON. If you are interested, you can get it from github. The resulting file can be fed to QGIS, in order to produce pretty maps, like the one bellow🙂


Piping an API into R: a Data Science Workflow

Inspired by @jeroenhjanssens, author of the Data Science Toolbox, I decided to give a go to one of the most unfriendly data sources: An XML API.
Apart from its rich syntax with query capabilities, I tend think XML is highly verbose and human unfriendly, which is quite a discouraging if you don’t want to take advantage of all its capabilities. And in my case I didn’t: I just wanted to grab a data stream, in order to be able to build some analysis in R. APIs are generally a pain for data scientists, because they tend to want to have “a look at things” and get a general feeling of the dataset, before start building code. Normally, this is not possible with an API, unless you use these high-end drag-and-drop interfaces, that are generally costly. But following this approach I was able to setup a chain of tools that enable me to reproduce this AGILE workflow, where you can have a feel of the dataset in R, without having to write a Python client.

The first step was to pipe the xml output of the query into a file, and that is easy enough to do with curl

curl -s 'http://someurl.com/Data/Entity.ashx?Action=GetStuff&Par=59&Resolution=250&&token=OxWDsixG6n5sometoken' > out.xml

Now, if you are an XML wiz you can follow a different approach, but I personally feel more comfortable with JSON, so the next step for me was to convert the XML dump into some nice JSON, and fortunately there is another free tool for that too: xml2json

xml2json < out.xml > out.json

Having the JSON, it is possible to query it using jq, a command line JSON parser that I find really intuitive. With this command, I am able to narrow the dataset to the fields I am interested, and pipe the results into another text file. In this case I am skipping all the “headers”, and grabbing an array of elements, which is what I want to analyse.

cat out.json | jq '[.Root.ResultSet.Entity[] | {color: .color, width: .with, average: .average, reference: .reference, Time: .Time}]' > test.json

Now here I could add another step, to convert the JSON results into csv, but actually R has interfaces to JSON, so why not use those to import the data directly. There is actually more than one package that can do this, but I had some nice results with jsonlite.

data1 <- fromJSON("test.json")

And with these two lines of code, I have a data frame that I can use for running ML algorithms.