AWS Athena helps to find the worst place to park your car in Portland.

YriUZjU.png!web

After visiting Portland, OR last weekend I’ve decided to explore some publicly available datasets about the city. In this post, we are going to calculate the number of incidents related to vehicles (theft from or theft of a vehicle) and the number of parking spots in each Portland neighborhood using Athena geo queries. After that, we will calculate the number of incidents per a parking spot to identify dangerous neighborhoods.

We are going to use three datasets:

A police report with all the incidents that occurred in Portland in 2019. You can download this dataset here (click on “Download Open Data” tab)
Portland parking spots with coordinates, you can download this dataset here
Portland neighbourhoods boundaries in JSON, you can download it here (select GeoJSON format)

A quick and easy way to start exploring a dataset with SQL is to use AWS Athena database and S3. Create an S3 bucket (I called it portland-crime-score).

For incidents file, create a folder “crime_data” in the bucket. Import the downloaded CSV file into the folder. After that go to AWS console, open Athena service and run the following query to create a table (replace S3 bucket name with yours):

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.incidents (
  `Address` string,
  `CaseNumber` string,
  `CrimeAgainst` string,
  `Neighborhood` string,
  `OccurDate` string,
  `OccurTime` string,
  `OffenseCategory` string,
  `OffenseType` string,
  `OpenDataLat` float,
  `OpenDataLon` float,
  `OpenDataX` float,
  `OpenDataY` float,
  `ReportDate` string,
  `OffenseCount` int 
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://portland-crime-score/crime_data/'
TBLPROPERTIES ('skip.header.line.count'='1');

For parking spots create a folder called “parking_data” and import corresponding file. Run the following query to create a table:

CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.parking (
 `X` float,
 `Y` float,
 `OBJECTID` string,
 `ModifiedBy` string,
 `ModifiedOn` string,
 `ModifiedUsing` string,
 `Comments` string,
 `NonAssetID` string,
 `Status` string,
 `Owner` string,
 `MaintResp` string,
 `LocationID` string,
 `ImagePath` string,
 `Metered` string,
 `APPZone` string,
 `UseCode` string,
 `ParkingDuration` string,
 `Rotation` string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://portland-crime-score/parking_data/'
TBLPROPERTIES ('skip.header.line.count'='1');

For the neighborhood boundaries file, create a folder called “neighborhoods_data” and import corresponding file. Neighborhoods data is stored in nested JSON file, that’s why the table schema looks different this time:

CREATE EXTERNAL TABLE neighborhoods (
    type string,
    features array<
        struct<type: string,
            properties: struct<OBJECTID: string,
                  NAME: string, 
                  COMMPLAN: string,
                  SHARED: string,
                  COALIT: string,
                  HORZ_VERT: string,
                  MAPLABEL: string,
                  ID: string,
                  Shape_Length: string,
                  Shape_Area: string
            >,
            geometry: struct<type: string,
              coordinates: array<array<array<string>>>
            >
        >
    >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://portland-crime-score/neighborhoods_data/';

Let’s look at a table incidents and run a few SQL queries to explore the data. First of all, let’s see how many records we’ve got:

select count(*) 
from incidents;

z2QBbiU.png!web

select * 
from incidents 
limit 10;

BBNvAbU.png!web

As we can see, some of the records don’t have coordinates of an incident. If we check the metadata of the dataset on the website, it says that they are removed on purpose:

Let’s see how many records we have without coordinates:

select count(*) as count 
from incidents 
where OpenDataLat is Null or OpenDataLon is Null;

vYnyYjb.png!web

We are going to exclude those records from further exploration. Now let’s see what types of incidents we have in the file:

select distinct OffenseType 
from incidents;

YfUJVfi.png!web

There is a total of 46 incident types. Let’s focus on those related to a vehicle:

select distinct OffenseType 
from incidents 
where OffenseType like '%Vehicle%'

RRbEBzV.png!web

Now let’s create a table that contains only Athena geo points of vehicle incidents and see how many rows we have after applying all these filters:

incident_points as (
select ST_POINT(opendatalon, opendatalat) 
from incidents 
where OffenseType in (
  'Motor Vehicle Theft', 
  'Theft From Motor Vehicle', 
  'Theft of Motor Vehicle Parts or Accessories') 
  and OpenDataLat is not NULL 
  and OpenDataLon is not NULL
)
select count(*) as incident_count 
from incident_points;

Now let’s create a table with geo points of all parking spots — this time we don’t have to filter anything:

parking_points as (
  select ST_POINT(x, y) as point 
from parking
)
select count(*) as parking_count
from parking_points;

ZBrAF3R.png!web

Now, let’s look at a neighborhood dataset. Let’s parse JSON to extract boundaries coordinates and create objects of type Polygon that are supported by Athena. I could not find an easy way to parse GeoJSON in Athena. Documentation said nothing about supporting it. That’s why I had to perform a series of manipulations with json to extract the required coordinates.

First of all, let’s see what we’ve got in the table neighborhoods:

select * 
from neighborhoods;

The whole file was parsed into a single row. We need to extract data from features columns that has a type of array. Let’s unnest the array like this:

select type, feature 
from neighborhoods
CROSS JOIN UNNEST(features) AS t(feature)

zUNz22a.png!web

Each feature represents a neighborhood. Now we can access its values like this:

select feature.properties.Name as name, feature.geometry.coordinates[1] as shape 
from neighborhoods
CROSS JOIN UNNEST(features) AS t(feature)

VBNfQnv.png!web

Here we have a neighborhood name and an array of arrays with boundary coordinates. Our aim is to convert these coordinates to a varchar format that Athena can parse into Polygon object like that: SELECT ST_POLYGON(‘polygon ((1 1, 1 4, 4 4, 4 1))’). To achieve this we should concatenate coordinates into a string. Also, two neighborhoods with names ‘MC UNCLAIMED #13’ and ‘CRESTWOOD’ had strange coordinates format that were not parsable so I excluded them from the list. Let’s run the following query:

with neighborhood_coords as (
  select feature.properties.Name as name,
  feature.geometry.coordinates[1] as shape 
  from neighborhoods
  CROSS JOIN UNNEST(features) AS t(feature)
  where feature.properties.Name != 'MC UNCLAIMED #13' AND    feature.properties.Name != 'CRESTWOOD'
),
unnest_nodes as (
  select name, array_join(node, ' ') as node_str 
  from neighborhood_coords
  CROSS JOIN UNNEST(shape) AS t(node)
)
select * from unnest_nodes;

UbQ7Fz2.png!web

We extracted neighborhood coordinates into multiple columns. Now all we need to do is aggregate them back with string concatenation:

suburbs as (
  select name as suburb_name, ST_POLYGON('polygon ((' || array_join(array_agg(node_str), ', ') || '))') as poly from unnest_nodes group by name 
)
select * from suburbs;

ua6FreZ.png!web

Now let’s join our neighborhoods with parking spots to count the number of spots in each neighborhood. We are going to use an Athena function ST_CONTAINS to check if a parking spot Point lies inside a boundary Polygon:

suburb_parking as (
  select suburb_name, count(*) as parking_count from suburbs 
  join parking_points on ST_CONTAINS(poly, point)
  group by suburb_name order by parking_count desc
)
select * from suburb_parking limit 10;

22yEz2Z.png!web

As we can see, downtown has the highest number of parking spots.

Now let’s do the neighborhoods with the incidents to count the number of incidents in each neighborhood:

suburb_incident as (
  select suburb_name, count(*) as incident_count from suburbs 
  join incident_points on ST_CONTAINS(poly, incident_point)
  group by suburb_name order by incident_count desc
)
select * from suburb_incident limit 10;

Ef6b6vn.png!web

The highest number of incidents is recorded in Hazelwood, and Downtown is in second place.

Finally, let’s calculate a ratio between the number of incidents and the number of parking spots:

select suburb_parking.suburb_name, CAST(incident_count as double)/parking_count as crime_score 
from suburb_incident
join suburb_parking on suburb_incident.suburb_name = suburb_parking.suburb_name
order by incident_count/parking_count desc;

faaUvmV.png!web

And to our surprise, the neighborhood with such a nice name as Richmond has the highest number of incidents per parking spot, when Northwest district is the safest place with 0,056 incidents per parking spot.

The whole query looks like this:

with regions_coords as (
  select feature.properties.Name as name, feature.geometry.coordinates[1] as shape from neighborhoods
  CROSS JOIN UNNEST(features) AS t(feature)
  where feature.properties.Name != 'MC UNCLAIMED #13' AND feature.properties.Name != 'CRESTWOOD'
),
unnest_nodes as (
  select name, array_join(node, ' ') as node_str from regions_coords
  CROSS JOIN UNNEST(shape) AS t(node)
),
suburbs as (
  select name as suburb_name, ST_POLYGON('polygon ((' || array_join(array_agg(node_str), ', ') || '))') as poly from unnest_nodes group by name 
),
parking_points as (
  select ST_POINT(x, y) as point from parking
),
incident_points as (
  select ST_POINT(opendatalon, opendatalat) as incident_point from incidents
  where OffenseType in (
    'Motor Vehicle Theft', 
    'Theft From Motor Vehicle', 
    'Theft of Motor Vehicle Parts or Accessories') 
  and OpenDataLat is not NULL 
  and OpenDataLon is not NULL
),
suburb_parking as (
  select suburb_name, count(*) as parking_count from suburbs 
  join parking_points on ST_CONTAINS(poly, point)
  group by suburb_name order by parking_count desc
),
suburb_incident as (
  select suburb_name, count(*) as incident_count from suburbs 
  join incident_points on ST_CONTAINS(poly, incident_point)
  group by suburb_name order by incident_count desc
)
select suburb_parking.suburb_name, CAST(incident_count as double)/parking_count as crime_score 
  from suburb_incident
  join suburb_parking on suburb_incident.suburb_name = suburb_parking.suburb_name
  order by incident_count/parking_count desc;

Next time when I go to Portland I will check this information to find the best place to stay there.

Recommend

Zero-with-Dot (Oleg Żero): Colaboratory + Drive + Github -> the workflow made...

Activity 的 36 大难点，你会几个？「建议收藏」

Machine and Deep Learning with OCaml Natively

老而不死的三种编程语言

认识 ALBERT：来自 Google 和 Toyota 的全新轻量级 BERT，性能不变、参数减少 18 倍

Nested Model in Google Firestore - FireO

Unfork()

Google 的 Code Review 实践经验

Web 性能优化、文档及代码编辑器相关的新提案

闲鱼前端基于serverless的一种多端开发解决方案

About Joyk