29

Network Analysis and Community Clustering using Chicago Ride-Share Data

 5 years ago
source link: https://www.tuicool.com/articles/vuqaa2U
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Network Analysis and Community Clustering using Chicago Ride-Share Data

Jul 28 ·4min read

The inspiration for this project came from an interest in the evolution of “smart cities” as well as the semi-recent release of ride-share data on the city of Chicago’s online data portal. Smart cities which collect, centralize, and publish their data are part of a movement to enable in-depth analysis on the state of their communities and resources. Given the rise of city populations, it’s becoming more important to learn how to use this data to inform decisions and programs which can be used to improve city life.

The goal of this project was to utilize Chicago’s public ride-share data in order to better understand where these rides were occurring and how specific geographic areas may be connected via ride-sharing. I accomplished this goal through the following steps :

1. Captured, cleaned and analyzed publicly available ride-share data taken from Chicago’s open data platform

2. Performed and visualized a network analysis of Chicago census tracts using ride-share data

3. Used clustering techniques to identify underlying communities based on ride-sharing

Data Cleansing and Exploratory Analysis involved the following steps:

  1. Accessing the data-set
  2. Removing rides which started/ended outside of Chicago
  3. Counting the number of rides which occurred at each

    pickup/drop-off location (unique routes)

  4. Creating a map of lat-long coordinates for each individual

    census tract

The ride-share data-set used can be accessed here : https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p

The full data-set can be downloaded as a flat file or accessed by calling the Chicago Data Portal API as shown below:

Example of the API request result

According to Chicago’s data portal, null values in any location field signified that the ride took place outside of Chicago. Since this data cannot be imputed, all rides (rows) with null values in relation to their coordinates are dropped.

Example of the network analysis data-set created using the above code

In order to later plot the network over a map of Chicago, we need a list of each unique census tract and corresponding lat/long coordinates. The method to create this coordinate map is shown below.

Network Analysis Using Networkx

With the data set cleaned and ready, the Networkx package was used to visualize each census tract location as a node in a graph, with connections between nodes (edges) symbolized by black lines.

ZzEVvyV.png!web

This is a non-directional graph representation of how Chicago census tracts are connected via ride-sharing. Not pretty right?

The raw graph representation gives almost no insight. By using the latitudes/longitudes of each census tract, we can plot the nodes of the network in a way that more resembles Chicago geographically.

VvyiM3A.png!web

Great! At least now we can see the graph taking shape. Notice the concentration of rides overlapping in the center of the city.

By reading in a publicly available shape-file of Chicago as well as the coordinate map created earlier, I was able to visualize this graph over an actual map of the city. I also added in a weighting scheme based on the total ride volume in each census tract to give a better idea of how many rides were starting or ending in each area.

aaEBJjV.png!web

With this we can clearly see that the majority of the rides take place in or around the Loop, a major urban area in Chicago. Also notice In the top left and mid left two large red circles representing Chicago’s two airports.

Community Clustering using Louvain ModularityWith the network built, algorithms can be used to cluster the network in order to identify locations which form communities based on the ride-share data. While a native of Chicago might be able to answer this question, I was curious to see how quickly I could learn about how transportation worked in a city I’d never been in with just ride-share data found publicly on the internet.

The method chosen comes from a paper called “Fast Unfolding of Communities in Large Networks.”¹ According to this paper, the method tries to maximize the number of links between nodes in each community, compared to nodes outside of their community. It does this by first randomly assigning each node to a community, then moves each node to a new community until modularity is maximized. This creates communities that are more closely linked within itself than with nodes which belong to other communities. Luckily for me, this method comes built in with Networkx.

Z3ANfmf.png!web

The clustering seen here intuitively made sense. There are clear distinctions between the Loop and surrounding nodes, downtown and uptown Chicago. It was interesting to see that both airports were clustered along with the high volume nodes in the Loop. I can assume that’s because many travelers are heading to downtown hotels from these airports.

What did I learn?

Data collected by cities can be a powerful tool in the right hands. Without ever stepping foot into Chicago, I was able to see how ride-sharing ties certain communities together, as well as where most of the rides were occurring. It would be interesting to see if the results of the community clustering changes over time of day or year! One thing to note would be how changes to ride-share policy would have different effects for each part of the city. While some people have been talking about the benefits of ride-sharing as a supplement to aging public transportation infrastructure, the network clusters here clearly show that some communities experience minimal benefit from ride-sharing.

The next thing that I want to do is apply time series forecasting methods to each of these community clusters in order to see how well I could predict the volume of rides. Hopefully other cities are learning from Chicago and are investing in their own open data platforms in order informed, data-backed decisions on urban planning and policy development.

The full code for all of my analysis to date using this data-set can be found on my github .

[1] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre, Fast unfolding of communities in large networks (2008), J. Stat. Mech. (2008) P10008


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK