Piping an API into R: a Data Science Workflow

Inspired by @jeroenhjanssens, author of the Data Science Toolbox, I decided to give a go to one of the most unfriendly data sources: An XML API.
Apart from its rich syntax with query capabilities, I tend think XML is highly verbose and human unfriendly, which is quite a discouraging if you don’t want to take advantage of all its capabilities. And in my case I didn’t: I just wanted to grab a data stream, in order to be able to build some analysis in R. APIs are generally a pain for data scientists, because they tend to want to have “a look at things” and get a general feeling of the dataset, before start building code. Normally, this is not possible with an API, unless you use these high-end drag-and-drop interfaces, that are generally costly. But following this approach I was able to setup a chain of tools that enable me to reproduce this AGILE workflow, where you can have a feel of the dataset in R, without having to write a Python client.

The first step was to pipe the xml output of the query into a file, and that is easy enough to do with curl

curl -s 'http://someurl.com/Data/Entity.ashx?Action=GetStuff&Par=59&Resolution=250&&token=OxWDsixG6n5sometoken' > out.xml

Now, if you are an XML wiz you can follow a different approach, but I personally feel more comfortable with JSON, so the next step for me was to convert the XML dump into some nice JSON, and fortunately there is another free tool for that too: xml2json

xml2json < out.xml > out.json

Having the JSON, it is possible to query it using jq, a command line JSON parser that I find really intuitive. With this command, I am able to narrow the dataset to the fields I am interested, and pipe the results into another text file. In this case I am skipping all the “headers”, and grabbing an array of elements, which is what I want to analyse.

cat out.json | jq '[.Root.ResultSet.Entity[] | {color: .color, width: .with, average: .average, reference: .reference, Time: .Time}]' > test.json

Now here I could add another step, to convert the JSON results into csv, but actually R has interfaces to JSON, so why not use those to import the data directly. There is actually more than one package that can do this, but I had some nice results with jsonlite.

data1 <- fromJSON("test.json")

And with these two lines of code, I have a data frame that I can use for running ML algorithms.



Programming Languages

Programming languages have been for long a matter of “taste”, usually with a strong “emotional” content tied to the choice (I am just remembering the “old” “Java vs C++” battle… 8-])

The truth is, if you are an efficient programmer, more concerned about “getting the job done” than in “fundamentalist passions”, you are more likely to use many languages, in different contexts; I guess that is what most of us do, anyway. There are no “right” or “wrong” choices, but there are more appropriated choices for a particularly task, and often, more than one.

The new trend now, seems to be “mix & match”, which for rapid application development and prototyping of scientific applications, may “work as a charm”. It is now possible to script many toolkits (R, MatLab, etc) using a language of your choice, so why make a redundant effort, when there are tools that are specialized on it?

This article approaches this question very well, I think.


Moreover I think Python can have a deserved protagonist role on this approach. It is a very flexible language, that is of course high-level, but with a low-level “feeling” (maybe because of the similar syntax to C?)

Inline with this approach I see a lot of potential on the IPython shell, and hope to blog a bit about it in the near future 🙂

P.S.: for GIS geeks, there is already a plugin for IPython in QGIS…