Piping an API into R: a Data Science Workflow

Inspired by @jeroenhjanssens, author of the Data Science Toolbox, I decided to give a go to one of the most unfriendly data sources: An XML API.
Apart from its rich syntax with query capabilities, I tend think XML is highly verbose and human unfriendly, which is quite a discouraging if you don’t want to take advantage of all its capabilities. And in my case I didn’t: I just wanted to grab a data stream, in order to be able to build some analysis in R. APIs are generally a pain for data scientists, because they tend to want to have “a look at things” and get a general feeling of the dataset, before start building code. Normally, this is not possible with an API, unless you use these high-end drag-and-drop interfaces, that are generally costly. But following this approach I was able to setup a chain of tools that enable me to reproduce this AGILE workflow, where you can have a feel of the dataset in R, without having to write a Python client.

The first step was to pipe the xml output of the query into a file, and that is easy enough to do with curl

curl -s 'http://someurl.com/Data/Entity.ashx?Action=GetStuff&Par=59&Resolution=250&&token=OxWDsixG6n5sometoken' > out.xml

Now, if you are an XML wiz you can follow a different approach, but I personally feel more comfortable with JSON, so the next step for me was to convert the XML dump into some nice JSON, and fortunately there is another free tool for that too: xml2json

xml2json < out.xml > out.json

Having the JSON, it is possible to query it using jq, a command line JSON parser that I find really intuitive. With this command, I am able to narrow the dataset to the fields I am interested, and pipe the results into another text file. In this case I am skipping all the “headers”, and grabbing an array of elements, which is what I want to analyse.

cat out.json | jq '[.Root.ResultSet.Entity[] | {color: .color, width: .with, average: .average, reference: .reference, Time: .Time}]' > test.json

Now here I could add another step, to convert the JSON results into csv, but actually R has interfaces to JSON, so why not use those to import the data directly. There is actually more than one package that can do this, but I had some nice results with jsonlite.

library("jsonlite")
data1 <- fromJSON("test.json")

And with these two lines of code, I have a data frame that I can use for running ML algorithms.

df

Advertisements

4 thoughts on “Piping an API into R: a Data Science Workflow

  1. I ‘d like to make 2 comments:

    – For xml parsing in the command line, instead of transforming to json and then use jq, i would rather use xmlstarlet (http://xmlstar.sourceforge.net) directly. In fact what you do with json and the jq tool corresponds to an XPath query in the original xml document

    – You can use jq and produce a CSV (which is more ..R friendly) as follows:

    out.csv

    cheers
    Stelios

  2. Hi Stelios, Thanks a lot for your comment: the reason why I did parse the content in JSON, rather than XML, its because I am more comfortable with the first, and being comfortable with a tool sometimes represents more efficiency, even when you have to do more things. However this may useful for other readers, and I def think it is good to choose from a wide range of options. Regarding the csv, R as drivers for reading JSON and I suspect also XML, but being a more “universal” format its good to have this option, also for other programs. Once again, thanks for contributiing with your suggestions! 🙂

  3. Pingback: JSON to GeoJSON with jq | heartcode

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s