43

Plotting Spatial data in R

 5 years ago
source link: https://www.tuicool.com/articles/hit/MruQf2Z
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
3yMfyeM.jpg!webIVFfAvn.jpg!web

I recently got an opportunity to work on spatial data and wanted to share my analysis on one such dataset.

The data consisted of various registered business in the San Francisco Bay Area which can be found here . An updated version can be found here .

Spatial data pertains to data which is associated with locations. Typically its described by a coordinate reference system, latitude and longitude.

The goal of this exercise was to find pockets of neighborhoods in San Francisco with high concentration of businesses. You would need to get a key from Google’s Geolocation API to use their maps. I used the ggmap package in R to plot this data. Then I narrowed down my analysis on one particular high concentration neighborhood to see how businesses were dispersed within that area.

First…Quick scan of the dataset

7BRZvqn.gif
str(biz)
head(biz, 25)
summary(biz)

For the purpose of this exercise I was only concerned with the Neighborhoods, address, dates and most importantly the location columns which contained latitude and longitude data for each business. Names of the businesses and their codes (which are assigned by the city for registered businesses) were not considered for now.

After doing basic data cleaning activities such as eliminating duplicates and nulls I extracted information only pertaining to the city of SF and eliminated records related to adjoining cities in the Bay Area.

Identify data pertaining to San Francisco only

There were a few ways I could go about achieving this; filter dataset based on c ity or by b usiness.location or by zip codes . I chose to use the zip code logic as the other two fields had inconsistent patterns of the San Francisco city name which could easily be missed out. I have however included commands for all three methods of filtering this data.

By zip

sf_biz_zip <- biz %>% filter(grepl(pattern = "94016|94105|94110|94115|94119|94123|94127|94132|94139|94143|94147|94156|94161|94171|94102|94107|94108|94109|94111|94112|94114|94116|94117|94118|94120|94121|94122|94124|94125|94126|94129|94130|94131|94133|94134|94137|94140|94141|94142|94144|94145|94146|94151|94153|94154|94158|94159|94160|94162|94163|94164|94172|94177|94188", Business.Location))

By city

sf_biz_city <- biz %>% filter((grepl(".*San Francisco.*|.*SAN    FRANCISCO.*|.*SF.*|.*S SAN FRAN.*|.*Sf.*|.*San+francisco.*|.*S+san+fran.*", City)))

By Business.Location

sf_biz_loc <- biz %>% filter((grepl(".*San Francisco.*|.*SAN FRANCISCO.*|.*SF.*|.*S SAN FRAN.*|.*Sf.*|.*San+francisco.*|.*S+san+fran.*", Business.Location)))

Converting date objects

Next I wanted to eliminate businesses which had seized to exist. For this I used the end dates for each location. However the date fields were stored as factors which were converted to posixct as that generally helps in further analysis when it comes to dates.

sf_biz_zip$Business.Start.Date <- as.POSIXct(sf_biz_zip$Business.Start.Date, format = "%m/%d/%Y")
sf_biz_zip$Business.End.Date <- as.POSIXct(sf_biz_zip$Business.End.Date, format = "%m/%d/%Y")
sf_biz_zip$Location.Start.Date <- as.POSIXct(sf_biz_zip$Location.Start.Date, format = "%m/%d/%Y")
sf_biz_zip$Location.End.Date <- as.POSIXct(sf_biz_zip$Location.End.Date, format = "%m/%d/%Y")

Filter out inactive businesses

Businesses which seized to exist after December 1, 2018 were eliminated.

sf_biz_active_zip <- sf_biz_zip %>% filter(is.na(Location.End.Date))
sf_biz_active_zip <- sf_biz_zip %>% filter(Location.Start.Date < "2018-12-01")

Stripping out coordinates from the Business Location field

The Business Location column contained addresses along with the coordinates information. So the latitude and longitude information needed to be extracted.

sf_biz_active_zip <- sf_biz_active_zip %>% separate(Business.Location, c("Address", "Location"), sep = "[(]")
sf_biz_active_zip <- sf_biz_active_zip %>% filter(!(is.na(Location)))
sf_biz_active_zip <- separate(data = sf_biz_active_zip, col = Location, into = c("Latitude", "Longitude"), sep = ",")

Other characters needed to be cleaned out too.

sf_biz_active_zip$Longitude <- gsub(sf_biz_active_zip$Longitude, pattern = "[)]", replacement = "")

I then converted latitude and longitude variables from discrete to continuous and stored them as numerical variables as this helps when plotting/visualizing data and to avoid errors.

sf_biz_active_zip$Latitude <- as.numeric(sf_biz_active_zip$Latitude)
sf_biz_active_zip$Longitude <- as.numeric(sf_biz_active_zip$Longitude)

Now the fun part…

Visualization the data

MJz2Afe.gif

The resultant dataset had 88,785 records which needed to be plot on a Google map. Interpreting these many records on a map would be overwhelming to say the least! Although sampling would be one way to proceed, I instead tried to find out the top 10 neighborhoods which had the largest number of businesses and plot one such neighborhood on the map.

viz <- sf_biz_active_zip %>% group_by(Neighborhoods...Analysis.Boundaries) %>% tally() %>% arrange(desc(n))
col.names(viz)[2] <- “Total_businesses”
viz <- viz[1:10, ]

I then created a histogram of these top 10 neighborhoods.

fin_plot <- ggplot(viz, aes(x = Neighborhood, y = Total_Businesses)) + geom_bar(stat = "identity", fill = "#00bc6c")
fin_plot <- fin_plot + geom_text(aes(label = Total_Businesses), vjust = -0.2) + theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1), plot.title = element_text(hjust = 0.5))
fin_plot <- fin_plot + ggtitle("Top 10 neighborhoods by business count", size = 2)
ENZ3YzI.png!web3U7Fjeu.png!web

Let’s look at the Financial District/South Beach neighborhood in more detail since it has the maximum number of active businesses.

Registering Google Maps key

I installed the “ggmap”, “digest” and “glue” packages then registered with Google API to get the the Geolocation API key.

install.packages("ggmap","digest","glue")
register_google(key = "<google maps key>")

Google provides terrain, satellite, hybrid among other types of maps. I chose to use the terrain map. A simple Google search can give you the city coordinates for San Francisco.

sf <- c(lon = -122.3999, lat = 37.7846)
map <- get_map(location = sf, zoom = 14, scale = 2)

By adjusting the zoom you can get a closer look. The two images below are with different zoom sizes

fin_map <- ggmap(map) + geom_point(aes(Longitude, Latitude), data = fin_dis)
fin_map <- fin_map + ggtitle("Concentration of businesses in Fin. District and South Beach") + xlab("Longitude") + ylab("Latitude") + theme(plot.title = element_text(hjust = 0.5))
yyiqiyY.png!webRBZBZvI.png!web
Zoom out view
eMVfI3r.png!webmMjMJj7.png!web
Zoom in view

A better visualization

A heatmap will probably make the visualization more intuitive.

fin_heatmap <- ggmap(map) + stat_density2d(data = u, aes(x = Longitude, y = Latitude, fill = ..density..), geom = 'tile', contour = F, alpha = .5)
fin_heatmap <- fin_heatmap + ggtitle("Concentration of businesses in Fin. District and South Beach") + xlab("Longitude") + ylab("Latitude") + theme(plot.title = element_text(hjust = 0.5))")
zAjQ3eu.png!webyeEF7bV.png!web

Conclusion

Areas around Powell Street bart station, Union Square and Embarcadero bart station have a relatively large number of businesses while as areas around South Beach and and Lincoln Hill are sparse populated.

Similarly other individual neighborhoods can be plotted to understand the distribution of businesses there.

This was a fairly straightforward way of visualizing spatial data. I welcome any feedback and constructive criticism.

Thank you for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK