DevRel – What is that?

Almost a year ago, I heard the term DevRel for the first time when Sara Safavi, from Planet, gave a talk at CodeOp and used that word to describe her new role. I knew Sara as a developer, like myself, so I was curious to learn what this role entailed and understand how it could attract someone with a strong technical background.

It turns out that DevRel – Developer Relations – is as close as you can be to the developer world, without actually writing code. All these things that I used to do in my spare time, like participating in hackathons, writing blog posts, participating in conversations on Twitter, speaking at events, are now the core part of my job. I did them, because they are fun, and also because I believe that ultimately, writing code has an impact in society, and in order to run that last mile we need to get out of our compilers and reach out to the world. Technology is like a piece of art – it only fulfills its mission when it leaves the artist’s basement and it reaches the museums, or at least the living room of someone who appreciates it.

I am happy to say that I am now the DevRel at the Open Geospatial Consortium. In a way, it is a bit ironic that I ended up taking this role in an organization that does not actually produce software as its main outcome. But in a way OGC is the ultimate software facilitator, by producing the standards that will be used by developers to build their interoperable, geospatial aware, products and services. If you are reading this and you are not a geogeek, you may think of W3C as a somehow similar organization: it produces the HTML specification, which is not itself a software, but how could we build all these frontend applications using React, Vue and so many other frameworks, without using HTML? It is that important. Now you may be thinking, “so tell me an OGC standard that I use, or at least know”, and, again, if you are not a geogeek, maybe you won’t know any of the standards I will mention. Even if you use, or have used at some point location data. And this is part of the reason why I am at OGC.

Location data is increasingly part of the mainstream. We all carry devices in our pockets that produce geo referenced data with an accuracy that was undreamed ten years ago. Getting hold of these data opens a world of possibilities for data scientists and data engineers, but in order for all these applications to be able to understand each other we need sound, well articulated standards in place. My main goal as DevRel at OGC will be to bring the OGC standards closer to the developer community, by making them easier to use, and by making sure that they are actually used. And maybe, just maybe, I will also get to write some code along the way.

Advertisement

Interactive Maps within React.js

Recently, I have been teaching a Full-stack development bootcamp at CodeOp (great experience!).

When the students reached project phase, I was very pleased to see a lot of interest in using maps. And that is easy to understand, right? geospatial information is associated to most activities these days (e.g.: travel, home exchange, volunteering), and interactive maps are the backbone of any application which uses geospatial information.

This made me think of a nice way of introducing the students to interactive mapping. I realized that most of them want to do one thing: read an address and display it on the map, which also requires the use of a geocoder. In order to demonstrate how to put all these things together within a React application, which is the framework they are using, I created a small demo on GitHub. This was also an opportunity to practice and improve my front end skills! 🙂

Following a good tradition of GitHub, I started by forking an existing project, which I thought was similar to what I wanted to achieve. Although the project is extremely cool, I realized that I wanted to move in quite a different direction, so I ended up diverging a lot from the original code base.

To implement the map, I used my favourite library for interactive maps, Leaflet. This library is actually packaged as a React component, so it is really easy to incorporate it into an application.

Of course, maps only understand coordinates, and most of the time people have nominal locations such as street names, cities, or even postcodes. This was also the case with my students. Translating strings with addresses to a pair of coordinates is not a trivial task, so the best thing is to leave it up to the experts. I used the Open Cage geocoder, an API to convert coordinates to and from places. Why? It has a much more generous free tier than the Google Maps API, and it is open-source. And although it is built on top of OSM Nominatum, it contains several improvements.

The good news are OpenCage also has a package for JavaScript and Node, and it is really easy to use. This is the piece of code, to retrieve the coordinates from a given string:

    // Adds marker to map and flies to it with an animation
    addLocation =() =>{
      opencage
        .geocode({ q: this.state.input, key: OCD_API_KEY})
        .then(data => {
          // Found at least one result
          if (data.results.length > 0){
              console.log("Found: " + data.results[0].formatted);
              const latlng = data.results[0].geometry;
              const {markers} = this.state
              markers.push(latlng)
              console.log(latlng);
              this.setState({markers})
              let mapInst =  this.refs.map.leafletElement;
              mapInst.flyTo(latlng, 12);
          } else alert("No results found!!");

        })
        .catch(error => {
          console.log('error', error.message);
        });


    }

In order to do this, you need to sign up for a free API key first, and store it within a secrets file (.env).

The application allows the user to type any address, and it will fly to it with an animation, adding a marker on the map.

You can check out the final result at https://leaflet-react.herokuapp.com/

marianella_watercolor

 

 

Spatial Data Mining

Social media streams may generate massive clouds of geolocated points, but how can we extract useful information from these, sometimes huge, datasets? I think machine learning and GIS can be helpful here.

My PechaKucha talk at DataBeers : “Visualizing Geolocated Tweets: a Spatial Data Mining Approach”.

Cluster Explorer Demo

Cluster explorer is a piece of software I was working on, which blends machine learning and GIS to produce a cluster visualization on top of a virtual globe.
The virtual globe uses NasaWorldWind, a Java framework for 3D geo-visualization based on OpenGL, while the clustering algorithms use the ELKI Data mining framework for unsupervised learning.
The tool allows to cluster a bunch of points (for instance geolocated Tweets) using one of two algorithms (or both), and explore the contextual information provided by the geographic layers (e.g.: OSM, Bing).

Importing Large Spatial Datasets into PostGIS

Is PostgreSQL/PostGIS suitable for Big Data? This is a question I have been asking myself for a while, and I now have the opportunity to test it with a PostgreSQL instance from Amazon Web Services (AWS); after all, if such a service cannot deal with “Big Data”, who can?
I have a dataset with Tweets in the city of Barcelona, over a time period. The entire dataset amounts to about 3 million points (82 MB), and I have some subsets: a smaller collection of about 2 million points (55 MB) and 300 000 points (8.2. MB).

big_data

We can argue whether this dataset is, or not, Big Data; but IMHO, a table with more than 1 million records starts to get “heavy” for any relational database. Now before doing any computations of spatial algorithms (e.g.: convex hull, point in polygon), the very first step is to actually load this dataset into the database. And this is where I started to struggle.

QGIS has a plugin called SPIT, that allows to import a Shapefile into PostGIS. Unfortunately it seems to be extremely slow any of the three files that I mentioned above (including the smaller one).

spit

Switching to the command line, where I have more control knowledge about the processes that are running, I tried shp2pgsql, a tool for importing Shapefiles that comes with the PostGIS installation.

On Linux it is possible to convert and import the Shapefile in one go, redirecting the output with pipe:

shp2pgsql -s <SRID> -c -D -I <path to shapefile> <schema>.<table> | \
psql -d <databasename> -h <hostname> -U <username>

Since this command was really slow (unresponsive?) I decided to do it in two steps, to see where it was hanging:

  • Create an sql file with the insert statements.
  • Load this sql file into the database.

Unsurprisingly, the SQL file with the statements was generated quite quickly, and the processing effort was being spent on the insert queries. This situation was verified with the three Shapefiles, even when I inserted a spatial index (Gist). After thinking a bit, I came to the conclusion that the spatial index should be created within the “CREATE TABLE” statement, so it should actually speed up the queries a bit (but I did not confirm this guess).

My next attempt, was to use ogr2ogr, the GDAL tool to convert between vector formats.

ogr2ogr -f "PostgreSQL" PG:"host=host user=user dbname=db password=pass" data.shp

I have to say that I read some suggestion to prefer shp2pgsql over this tool, and in fact it was quite slow as well (at this point I was only testing the smallest of the Shapefiles).

My conclusion up to this moment, is that it is quite expensive in terms of time (and processor) to import a large dataset into a PostGIS. Some ideas that could be explored:

  • Upgrade the PostGIS server (would scale vertically make any difference?).
  • Upload the data into the Amazon server and run the import tool from there. I am not sure whether it is possible to do that, but it seems like a good improvement to do this operation locally, and using the power of the Amazon Server, rather than my desktop computer.
  • Split the Shapefile into chunks of data. I am not entirely sure how to do this, and any suggestions would be welcome.

Next steps: once I manage to import a large Shapefile into my PostGIS instance, I will post some feedback about the computation of spatial operations.

Update: Apparently I underestimated the fact that I am importing data over a network connection, with a limited latency. The best thing would be to make the import locally, an option that I unfortunately don’t have, since the Amazon RDS does not give me SSH access (it is a service, not a machine!). I decided to switch to a new approach: import a plain csv file into a database table, using the /copy command, and that was actually really fast (a few seconds).

psql database -U user -h host.eu-west-1.rds.amazonaws.com -c "\copy newt_table from 'data.csv' with DELIMITER ','"

Of course, I then had to run a query populate the point geometry from the float (x,y) values, and create an update a spatial index (for a matter efficiency it is recommended that indexes are created after, and not during the import!).

UPDATE new_table SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

CREATE INDEX [indexname] ON [tablename] USING GIST ( [geometryfield] );

VACUUM ANALYZE [table_name] [column_name];

That was still considerably fast (although not as much as the csv import!), and my guess is: it is because I are now running queries on the remote machine (and not uploading data via an internet connection).

Following some very usefull on Stack Overflow, I decided to retry my previous approaches, tuning the configuration parameters.

First I tried the OGR command line tool, but this time with the config flag “PG_USE_COPY YES”, to force dumping of the data:

ogr2ogr -progress -f PostgreSQL PG:"dbname='database' user='user' host='host.eu-west-1.rds.amazonaws.com' port='5432' data.shp --config PG_USE_COPY YES -nlt POINT -nln import_ogr

Although speed has improved massively, it still took me 1:03:00 to run this query.

Then I tried the shp2pgsql tool, with the “-D” flag, which also forces the dump of data:

shp2pgsql -D -s 4326 data.shp import_shp2pgsql | psql

This was remarkably fast: 00:01:04

compare_imports

Comparing the two imports, we can see that shp2pgsql did not create any index (which we already commented is a costly operation), while ogr2ogr created a gist index on the spatial field. The extra step of creating an index, has to be added to the shp2pgsql import, but as we have seen before, that operation is not a problem since we are then operating on the server.

I conclude this review, stating that shp2pgsql has a great performance in the task of  loading a Big Data spatial table into a remote PostGIS server (e.g.: Amazon RDS). It is recommended that the “-D” flag is passed to the tool, and that no indexes are created during this operation.

Bellow we can see the PostGIS geometry of the imported table, with more than 3 million points:

layer_shp2pgsql