Using the Socrata Open Data API and Gem to Make Complex Queries
24 Dec 2014Over the past few weeks, I’ve been working on my capstone project for Metis. My project, SaferNYC, explores data from NYC Open Data, visualizing New York City car collisions involving cyclists and pedestrians. You can view my project at safernyc.com.
The first step of my project was of course pulling in data from the dataset on NYC Open Data. NYC Open Data is maintained by Socrata, so the API connected with the site is the Socrata Open Data API, or SODA. At its most basic, SODA can be queried through a url including parameters writing in SoQL, Socrata’s version of SQL.
My query, however, quickly grew out of control for this approach. I only wanted data that was recent, had location information, and involved a cyclist or pedestrian death or injury. This created a url hundreds of characters long, and it just kept growing.
As my query got increasingly unmanageable, I discovered that there is also a gem for the API, soda-ruby. While the gem was very helpful for cleaing up some aspects of my query, the gem does not address the unwieldy SoQL parameters.
The soda-ruby docs contain only one simplistic example and little explanation. It’s initially unclear how the gem is supposed to work, but the reality is that a complex query requires injecting strings of pure SoQL - almost exactly what would be used in the url approach.
The main benefit of using the gem is in the setup. Using the gem, environmental variables can be easily included in the Ruby code. Also, clauses such as $select
(the data points requested), $where
(requirements the data should meet), and $order
(how the data will be sorted and returned) can be emphasized as keys pointing to the corresponding SoQL parameters rather than getting lost in a string.
It’s a bit downhill from here. I used a slightly-better-looking array of symbols rather than a string to create the $select
call, but as you may notice above, the array is simply joined with commas, creating a string of SoQL, before being passed in as a parameter. The symbols correspond to columns in the dataset from NYC Open Data (thus using their naming) and ask that the data returned includes each of these columns for each data point.
Tricks like the array of symbols above don’t really work for the rest of the parameters because they’re more complex. Using strings of SoQL instead gets very ugly very fast, so I broke mine up into numerous small methods. While this is better than one very long string, the methods are quite difficult to follow.
The methods below (which make up the rest of the class show above) contain SoQL giving the following paramaters:
-
Data is from dates later than the last incident saved in the database (or, if the database is empty, is from six months ago)
-
Data has either latitude and longitude or cross streets and zip code
-
Data includes a cyclist or pedestrian fatality or injury.
This is all necessary because some points in the dataset are missing essential information, such as location, rendering them useless to me. More importantly, there is such a huge volume of data (and an API limit of 1000 data points per call - no more than a month of relevant data), that I can’t waste time or space pulling in information that is old, irrelevant, or useless.
These parameters were as annoying to write as they look, but they’re relatively straightforward, and I end up with data that includes the information I need but no more. Socrata’s documentation on SoQL is much better than their documentation on soda-ruby, and it made the content of this query quite simple to write.
Next, I’ll cover the data this query returns and how I clean and store it.