JSON to GeoJSON with jq

A lot of people and institutions have already made the jump of providing data in JSON, which is great, since it is an inter-operable standard and a semi-structured form of data. However when it comes down to geographic data, standards don’t seem to be so common. I have seen many different ways of encoding geospatial information within JSON, normally involving listing an array of coordinates, with or without name fields for lat and long. Rarely there is any CRS associated to this data (which could be ok, for the case that it uses WGS84), or any mention of the geometry type.

This information is more or less useless, without some pre-processing to convert it into a “GIS-friendly” format, that we could use in QGIS, GeoServer, or even R.

Since we are dealing with JSON, the natural thing would be to convert it into GeoJSON, a structured format for geographic data. And the perfect tool for doing this is jq, a tool that I mentioned in a previous post. To make it simpler to understand I will explain what I did a specific JSON dataset, but with some knowledge of jq (and GeoJSON), you could literally apply it to any JSON dataset with geographic information within it.

My test dataset was the description of a set of roads, from the city of zaragoza,.

http://www.zaragoza.es/trafico/estado/tramoswgs84.json

The description of the dataset says that it is in “Google” format, which one could erroneous interpret as spherical mercator, but the name of the file suggests WGS84, and a quick look at the coordinates can confirm that too. This is literally a list of tracks, each one containing a list of coordinates that define the geometry. Let us look at an example:

{
  "points": [                                                                                                                                        
    {                                                                                                                                                
      "lon": -0.8437921499884775,                                                                                                                    
      "lat": 41.6710232246183
    },                                                                                                                                               
    {                                                                                                                                                
      "lon": -0.8439686263507937,                                                                                                                    
      "lat": 41.67098172145761
    },                                                                                                                                               
    {                                                                                                                                                
      "lon": -0.8442926556112658,                                                                                                                    
      "lat": 41.670866465890654
    },                                                                                                                                               
    {                                                                                                                                                
      "lon": -0.8448464412455035,                                                                                                                    
      "lat": 41.67062949885585
    },                                                                                                                                               
    {                                                                                                                                                
      "lon": -0.8453763659750164,                                                                                                                    
      "lat": 41.67040130061031
    },
    {
      "lon": -0.8474617762602581,
      "lat": 41.669528132440355
    },
    {
      "lon": -0.8535340031154578,
      "lat": 41.66696540067222
    }
  ],
  "name": "AVDA. CATALUÑA 301 - RIO MATARRAÑA -> AVDA. CATALUÑA 226",
  "id": 5
}

So the task here would be to convert this into a GeoJSON geometry (a linestring). For instance:

  { "type": "LineString",
    "coordinates": [ [100.0, 0.0], [101.0, 1.0] ]
    }

In jq, we want to loop through the array of roads, and parse the lat, long coordinates of each road object. This coordinates are themselves another array. If we do something like this:

cat tramoswgs84.json | jq  '.tramos[2]| .points[].lon,.points[].lat'

We are asking for the longitude and latitude coordinates of track 2, but since jq evaluates expressions from left to right, it will gives us back the array of longitude coordinates and the array of latitude coordinates, not the pairs.

The key thing is to use map, that will run the filter for each element of the array:

cat tramoswgs84.json | jq  '.tramos[2].points| map([.lon,.lat])'

The complete jq syntax for generating one linestring object, would be:

cat tramoswgs84.json | jq  -c '.tramos[1]|  {"type": "LineString", "coordinates": .points | map([.lon,.lat])}'

The next step would be to create a GeoJSON containing the entire collection of linestrings. Since we would like to attach attributes to them (“name” and “id”), we rather generate a “feature collection“, than a “geometric collection”. The code for generating each feature would be:

cat $1 | jq  -c '.tramos[]| {"type": "Feature","geometry": {"type": "LineString", "coordinates": .points | map([.lon,.lat])},"properties":{"name": .name, "id": .id}}' >> $2

And then we need to do a few text manipulation operations, that I could not find a way of performing with jq. Basically we need to add the opening tags for the feature collection, commas between each object, and then add closing tags for the feature collection.

I did this text manipulating tasks with sed, and put everything inside a shell script, that will transform the JSON file (in the format that I described) directly into a valid GeoJSON. If you are interested, you can get it from github. The resulting file can be fed to QGIS, in order to produce pretty maps, like the one bellow 🙂

roads_zgz

Advertisements

Data Mining| Machine Learning

Together with a colleague, I have been involved in the “hard” task of drafting a diagram (or a “mindmap”) that would connect logically, some of the “buzz words” regarding “data science”; e.g.: artificial intelligence, machine learning, data mining, recommenders. Moreover we wanted to provide a classification that would organize the different “algorithmic families” into some sort of typology. Hard task, I know, mostly because there are many classifications, based on the approaches we want to take; e.g.: by learning method, by task. We ended up not with one diagram, but with two, separating “data mining” and “machine learning”, in order to explain them better.

In the “Data mining” diagram, we include a general distinction between “descriptive” and “predictive” data mining, and within these two, we follow with sub divisions that finish in data mining techniques that may or not belong to machine learning (e.g.: statistics). On the bottom of the diagram, we represent the generic data mining applications, that make use of these techniques. One key difficulty in drafting this diagram is the fact that some techniques can include other techniques, and it is not easy to reflect that in the diagram. For instance, machine learning techniques typically make use descriptive statistics such as dispersion or central tendency.

Data Mining

In the “Machine learning” diagram we went for a more “scientific” view (less problem oriented), and tried to show how machine learning fits into the broader field “Artificial Intelligence”. Then we took the “learning approach”, as a way of classifying ML techniques. At the leafs of this tree, as well as at the leafs of the “Data Mining tree”, there are examples of techniques/algorithms relevant for the specific types; it is not an *extensive* list of algorithms, neither it claims to select the most *important* algorithm (if there is such thing…); sometimes the criteria for choosing the algorithm is greatly *subjective*: because we worked or read about it, or even because it was the only example we could find…

Machine Learning

Clearly there is some degree of overlap between the two diagrams. Machine learning is part of Data Mining, and therefore some algorithmic “families” are presented in both diagrams. However we believe that in this way, it becomes easier to describe what “machine learning” is, as a scientific discipline, and how it “fits & mixes”, within the “wide umbrella” of data mining.

This diagram was based on a lot of reads (mostly blogs), on our own knowledge and a lot of discussion. It is not “written on stone”, and I don’t even know if it is possible to have such a thing, regarding a topic that is so difficult to classify, either because it is evolving so fast or because it is often very “fuzzy”. In any case, any (constructive) critics or commentaries regarding ways of improving these diagrams, or even just some thoughts would be greatly appreciated.

Ubuntu 4 Beginners

After installing Ubuntu three times, in the past few months, and after having many requests to do it again, I have finally decided to put it all together in a workshop. It is going to be next Saturday, in Barcelona, in my favourite co-working space. And it is “free” as beer, and GNU/Linux 🙂

The “official” announcement will be tomorrow, I think, but you can be the first to read it here 😉

UPDATE: THIS WORKSHOP HAS BEEN POSTPONED!

Ubuntu 4 Beginners

Did you ever think about installing Ubuntu, but never actually have the “courage” to do it alone? Then this workshop is for you.

tux1

In the first part I will introduce the GNU/Linux Operating System, by explaining some basic concepts and showing some applications.
The second part will focus mainly on the installation process of Ubuntu, and I will install it “live”, on a virtual machine.
At the end of the session I can help people who are interested, to perform the installation on their own computers. Note that this will be *at their own risk*!

Target Audience:
This workshop targets people with a limited knowledge of *Nix systems, although some proficiency in using computers would be nice.
If you are a proficient *nix user or developer, and are interested in specific parts of the OS (such as the kernel), you may be interested in a more advanced workshop. If you are wondering what a *nix user is, please come: this workshop is for you 🙂

If you have Ubuntu installed on your laptop, or you are planning to install it at the end of the workshop, you may bring it with you. Otherwise, laptops are not required.

ubuntu_banner

Practical Info:
The duration of the workshop is approximately 2 hours (11:30h-13:30h), including a 10 minute break. Note that this is a free workshop, but you do *need to register*, in order to attend. Please do it, by filling this form: it should only take 2 minutes.
For practical reasons, I will limit the number participants to 20, on a: “first come, first served” basis.

This workshop is hosted by MOB/Made (Calle Bailen 11, Bajos. 08010 BCN) and all donations collected by the bitcoin wallet bellow will be given to Made, a non profit organization.

1GcD6YZLMvV4cNv7WckS22FJKMNdtjQJPE

Bitcoin

Static Linking?

From time to time, I have this moments when I cannot deploy my application properly and decide that I want to link it statically (then I generally give up, because it requires me to link the Qt libraries statically…). But is it really better to prefer static over dynamic linking?

As in so many other cases, it depends on what you want to do. I read that in terms of performance, there are trade-offs in both approaches, so in the end it really does not matter so much. From my point of view, the biggest advantage of static linking is the fact that you can ship one single file with your application, removing the risk of “broken” dependencies. That is, in terms of deployment, quite an advantage!

On the other hand, if everybody would link statically, we would literally have “thousands” of libraries “repeated” inside our system, packed inside “huge” binaries. It does not make much sense, does it?

Dynamic libraries are also “cool”, because we can (till a certain extent) replace them by newer (improved) versions, without having to recompile our application. That is like a huge benefit, in terms of “bug fixing” of third party libraries.

After removing the performance issue, my verdict would be:

  • For myself, I would like to minimize resource consumption by using as much as possible, shared libraries (dynamic linking).
  • For “bullet proof” systems, where users are not experienced in installing software, and are likely to “mess up” the system by removing parts of it, I would consider providing them statically compiled versions of the software, instead. The software will likely be “bigger “(although there are tools to minimize this, such as UPX), and a bit more “hungry” of resources, but this is also the only way to prevent the DLL hell.

Finally, it is important to mention that the type of linking may be conditioned by licensing issues.  For instance due to the “nature” of the license, GPL libraries would “contaminate” any software statically linked with them.

“Hello World” Cross-Compiling

I recently switched to work on a Linux environment, in a project where most of the potential users will be using Windows.
Theoretically that is not a problem, since I am sticking to standard C++, and a few other libraries that are fully cross-platform (e.g.: Qt[1], Boost[1])
Of course in practice, things are always slightly more complicated 🙂

Since my original project was created in Windows, using the nmake compiler, I still have a functional environment that I can use. However this solution requires me to:

  1. start Windows;
  2. checkout the latest source code;
  3. tweak the configuration for the Windows environment;

It is mainly 1. and 3. that bother me; I don’t start Windows that often, and there are quite a few things to tweak, when you change from GNU g++ to MS nmake[3].

There are many options to make things a bit “smoother”, and one of them would be to change the compiler to Jom[4] or MinGW, so that it could be use in both OS, with minimal tweaks. Another option would be to create a repository only for the source code, and leave the configuration files unchanged, both in Windows in Linux. These are all things that I want to explore, but for the moment I ceased to my “fascination” with cross-compiling, and decided to investigate it a bit further. Wouldn’t it be so cool to be able to generate a Windows binary, straight from Linux? It seems like the perfect option, if it was that easy!

To get a functional cross-compiling environment, you basically will need Wine[5] and MinGW[6]. In a nutshell, Wine is a is a compatibility layer capable of running Windows applications on several POSIX-compliant operating systems; the acronym says everything: “Wine Is Not an Emulator”. MinGW is a minimalist development environment for native Microsoft Windows applications. Installing these software in Ubuntu, is as easy as:

sudo apt-get install wine mingw32 mingw32-binutils mingw32-runtime

This should be it; so I started with a simple example that I grabbed from [7]:

#include 

int APIENTRY WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance,
    LPSTR lpCmdLine, int nCmdShow)
{
  MessageBox(NULL,
    "Cette fenêtre prouve que le cross-compilateur est fonctionnel !",
    "Hello World", MB_OK);
  return 0;
}

I compiled it like this, using MingGW:

i586-mingw32msvc-g++ -o essai.exe essai.cpp

To execute the binary, you will need the mingw library: mingwm10.dll, that can be copied like this to the current directory:

gunzip -c /usr/share/doc/mingw32-runtime/mingwm10.dll.gz > mingwm10.dll

Then the application can be launched with:

wine essai.execute

hello2

References:
[1] http://qt-project.org/
[2] http://www.boost.org/
[3] http://msdn.microsoft.com/en-us/library/dd9y37ha.aspx
[4] http://qt-project.org/wiki/jom
[5] http://www.winehq.org/
[6] http://mingw.org/
[7] http://retroshare.sourceforge.net/wiki/index.php/Ubuntu_cross_compilation_for_Windows

Developing with Qt Creator

I recently decided to switch to develop in Linux, and switch from MS Visual Studio to Qt’s official cross-platform IDE: Qt Creator. I have to say it is quite a responsibility to replace Visual Studio, because from my personal experience Visual Studio is one of the best IDE’s (and also the best Windows application).

First things: to migrate a QT based project from Visual Studio to Qt Creator is not really an headache. First of all, generate the project file (.pro) in Visual Studio (if you don’t have it already), and be sure to “tune” it as you like! Next thing, you can import into QT Creator as a project file, and everything should run out-of-the-box.

One thing that deserves attention are the building configurations. You need to add at least one building configuration, for the Qt Version in use. You can do that by going to “Projects” on the left side pane. By default, Qt will create a “debug” and a “release” mode, so you will only need to setup one configuration. “Shadow builds”, are builds located outside the source directory, which as a matter of principle is a “tidy” idea; to enable them, you just have to check the “shadow build” checkbox, under each mode. Regarding the run configuration, I would think that is a good idea to point to the debug binary, if you ever want to debug your application (I bet you will). And this brings me the last topic: debugging in Qt Creator.

The fact that there is not so much documentation about Qt Creator (apart from the fancy “welcome wizards”) and the lack of profile and analysis tools, when compared to MS Visual C++ does not bother me too much. But what really bothered me was to be able to setup the debug, and since I almost spend a morning on that, I will write it here and hope it will be useful to someone.

In a nutshell, Qt Creator uses the Linux “official” compiler g++, and official debugger,gdb. As a general concept I think that is great: why develop new tools, when you have working (great) ones?
Now to have the gdb working correctly with Qt Creator is another story. I installed the latest version, from the ubuntu repositories:

sudo apt-get install gdb

It did not really worked very well. Then I read that Qt Creator uses python for debugging and that gdb had to be compiled with Python enabled, so I found a version of gdb released by the Qt Creator project, and switched to that one. It did not work very well either. Finally, after many combinations, I discovered that it was actually Python that had to be “gdb-enabled”. So I installed this version of Python:

sudo apt-get install python2.7-dbg

And “asked” my system to use that one as a default:

sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.7-dbg 30
sudo update-alternatives --config python

(the exact code differ for your system).
Finally, I tested the python support in gdb by calling it from the command line and entering a python command.

GNU gdb (GDB) 7.5-ubuntu
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
.
(gdb) python print 123
123
(gdb)

If you see the “123” on the gdb prompt, everything should be ok. Finally I tested gdb with my application, running it from the command line:

gdb faocasd

If all is running ok, then you can go back to Qt Creator and start the application in debug mode (by pressing F5). I encountered a few other small problems, but I was able to debug. Important to mention that the debugger does not always pick the changes, for which I advise you to always recompile your project! (I know, it may be annoying for a large project).

All is well, when it finishes well: here is my “hello world” Qt Creator debug! 🙂

Hello Debug! :-)

#%^~***!~@_@?

Recently I contacted an Hacker group, and I have been excited about joining their meetings (I’ll disclose the name later).

When I say “hacker”, I mean it in the good and original sense of the word, mostly “white hat”;

I have been thinking about a couple of topics that I would like to bring up in a meeting:

  • Backing up data: what is the best way to back up data? a RAID configuration, an external USB drive, the cloud, a computer “hotel”? I am looking for the characteristics of different solutions: price, efficiency, safety, privacy level, etc.
  • Online Privacy: Google spies on you, Twitter spies on you, Facebook spies on you. I basically would like to assemble a set of tools (preferentially FOSS) that would allow a normal person to have an online presence and use, safeguarding their privacy; e.g.: use DuckDuck instead of Google Search, use Opera instead of GMail, etc. I think these tools/recommendations should cover at least one case in each of the following categories: blog, micro blog, social network, webmail, pop mail, search engine (more recommendations are welcome!)

I am looking forward to hear comments and ideas about this. Please leave your comments here, or tweet me @doublebyte 🙂

>REGEX!W\

For a long time, I have been thinking about learning about of Regex, and today I just found the perfect excuse.

I have 277 strings like this:

ALTER TABLE Sampled_Strata_Vessels ADD CONSTRAINT DF_Sampled_Strata_Vessels_description DEFAULT(('missing')) FOR description

And I want to turn them into something like this:

ALTER TABLE Sampled_Strata_Vessels ALTER description SET DEFAULT ('missing');

(yes, I am converting constraints from SQL Server to PostgreSQL!)

I had a couple of options here:

– change them by hand (that is: 277 times!)

– write a C++ program to parse the strings and generate the results (or even better: a Python program!)

Somehow I did not felt like doing any of those things, so I decided to do something that took even longer, but hopefully made me gain something: learn how to do a regular expression and apply it in in Notepad++.

Regular expressions look both scary and magic;  it took me a couple of hours to understand what are groups, how to search for literals, etc.

In the end of the day I ended up with a smile on my face, and these two expressions:

search:

(ADD)(.*)(DEFAULT)(.*)(FOR)(.*)

replace:

ALTER \6 SET DEFAULT \4;

It looks stupidily easy, but I cannot express how usefull this sentences were for me, allowing me to modified 277 lines of text, in a milisecond!

The first expression parses my sentence and separates it into groups. The first group starts with the word “ADD”, then I have anything up to “Default”, the literal “default”, anything up to “FOR”, “For” itself and then the rest of the sentence (which is a last word). With these, I have all the blocks that I need to build my string.

So the replace expression starts just after ADD (it does not replace the beginning of the sentece) and I ask it to replace it by “ALTER” and the last group – the column name – (\6) then “set default” and the 4th group (\4), the default value.

That’s it 🙂

Replication with PostgreSQL

PostgreSQL does not offer any in built support for asynchronous multi-master replication. According to this official page, some providers implemented this functionality, so I decided to have a go with the only two that advertise lazy multi-master replication: rubyrep and bucardo.
Rubyrep offers two versions: one using ruby and another using ruby and java. It is adverstised as being “really easy to use and database independent”, which are undoubtedly some cool features 🙂 I started with the most simple one (ruby only) and sadly I was unable to run it under linux 😦
I followed the instructions on this page, and in fact the installation is really easy, but when I try to scan the databases for differences, with this command:

rubyrep scan -c myrubyrep.conf

I kept running into this exception (and various warnings):

Exception caught: PG::ConnectionBad: connection is closed: SELECT tablename
FROM pg_tables
WHERE schemaname IN ('"$user"','public')

This happened, even if I reduce the conditions to the simplest: two databases in the same server, and the same user!!

After reading a bit, I realized there are some issues with the versions of Ruby (and other tools) and apparently the project has been inactive for a while (discontinued?). It is a shame because it looks really simple, but I don’t think it is really worth it to put a lot of effort on a software that has been discontinued 😦

Bucardo is perl-based and it is an absolute nightmare to install. I followed the instructions on this page, and as they warn the install script does not work, because it tries to create a “bucardo” user on the fly, but does not ask you to set a password! 😦
After fiddling with it for a while, I found a website that suggests that you edit the script (bucardo_ctl) and edit the default password.
Some things that I would recommend to get bucardo running:

  • set “permissive” authentications for the bucardo on pg_hba.conf
  • add a system bucardo user, and set the credentials on .pgpass

After that, you can add the databases that you want to syhcnronize with:

bucardo_ctl --dbhost=192.168.1.11 --dbpass=password add database test

Note that I was only able to run bucardo_ctl, passing the entire row of arguments (host, password).
Next step is to add the sync; bucardo supports three types of sync algorithms: fullcopy, pushdelta and swap.

bucardo_ctl add sync sync-delta source=test targetdb=test2 type=pushdelta tables=distributors

After that, you can start the bucardo daemon, and wait for the synchronization:

bucardo_ctl --dbhost=192.168.1.11 --dbpass=password start

The only method that supports circular (multi-master) replication is “swap”, and it has some conflict detection/resolution. Sadly for me, it works exactly with two databases 😦

My conclusion about the add-ons for supporting lazy replication in PosgreSQL is that they are both hard to install/configure and Bucardo (which seems a more mature and lively project) is still limited to two masters. I did not attempt more complicated scenarios, such as running databases at different hosts.