Open Data: .CSV and SQL Queries

For the open data task we have been asked to find a publicly available data set and convert it into a useful format that can be viewed on Google Earth. I decided on a data set of Rail Crossing Incidents over the last few years (Link). This mostly involved incidents where a vehicle hit the crossing boom.

OpenDataOriginal

My first attempt involved simply throwing the data set into a CSV to KML converter. The result created a number of pins on each crossing, however it was unable to process all the data into the description field. This meant that the data was fairly useless. My second attempt was to write a converter myself in C++ and specifically tailor it to the data set. While I was able to pull the data in just fine I ran into some issues when trying to manipulate it to make it more useful. It was suggested that we use SQL queries to try and clean up the data and make it more useful when converted to KML. After running into some issues trying to not only understand the format of an SQL Query, but also the specific formatting required to use it inside a Google Sheets document, Greg introduced me to SQLiteStudio. After importing the data set into a new database I started experimenting with different queries to try and get the data into a better format.

To start with I tried removing some of the less useful information (collision code, type, incident level, etc.) and condense the data so it only shows one entry for each crossing.

select “Crossing Road Name”, “Nearest Station”, Latitude, Longitude from RailCrossingDataOriginal group by “Crossing Road Name”

I decided to change this slightly by adding a new column that shows the total number of entries for each crossing.

select “Crossing Road Name”, count(“Level Crossing ID”) as ‘Frequency’, Latitude, Longitude from RailCrossingDataOriginal group by “Crossing Road Name”

Since I was able to get the frequency for each crossing I decided to try making a combined “wieght” value for each crossing based on number of occurrences, incident level, injuries, and fatalities.

select “Crossing Road Name”, “Nearest Station”, Latitude, Longitude, (4 / “Incident Level”) * count(“Level Crossing ID”) + (1 * count(“Minor Injuries”)) + (2 * count(“Serious Injuries”)) + (3 * count(Fatalities)) as Weighting from RailCrossingDataOriginal group by “Crossing Road Name”

OpenDataWeighted

Unfortunately I can’t get Google Earth to install on my PC at the moment so I will need to wait till my class tonight to see if this version is better than the previous one. My current plan, based on the range of the weighting values, is to use them as altitude in the positional data and see how the result looks.

Moving forward from this I want to look into two alternatives. First is to write a custom converter and turn the weighting value into a colour for the pin to better show the “danger” level of that crossing. The other is to condense the data as I have above but include the individual entries for each crossing within a single pin.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s