Using IO objects in python to read data

Just a quick post on a technique I’ve used a few times recently, in particular when reading web data.

First for a very quick example, in python when reading data with pandas, it often expects a filename on disk. For pandas, e.g. pd.read_csv('my_file.csv'). But if you happen to already have the contents of the csv in a text object in memory, you can use io.StringIO to just read that object.

import pandas as pd
from io import StringIO

# Example csv file inline
examp_csv = """a,b
1,x
2,z"""

pd.read_csv(StringIO(examp_csv))

AL9nZEVwc1npmeHbyU5uqKIPdi75VzjSphzqjdEo1b4bJv8kxamropmC038KxT6-arPqoH5Ouzpgre0Iox67uNxXUX3tpnUgfFhyPHLzUVbPN_R0T5LCZDbqe-8gLPvDjG_M6xjwRhqxQJBqGDhOnCBFEaZO=w501-h286-no?authuser=0

Where this has come up for me recently is reading in different data from web servers. For example, here is Cary’s API for crime data, you can batch download the whole thing at the below url, but via this approach I currently get an SSL error:

# Town of Cary CSV for crimes
cary_url = 'https://data.townofcary.org/explore/dataset/cpd-incidents/download/?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true&csv_separator=%2C'

# Returns SSL Error for me
cary_df = pd.read_csv(cary_url)

AL9nZEWPRhnZ7fvv6JudqjQTeDppwH2fcFdayWGDT74YnENpjgDkdJkQS3PS6HeBiO5Oa2HelVRxuxq5__rlwXX3UsBGLJQvnsS074feTxFICr5GyKKfZBGe6SwHyzBC2pRzq_quetmIK4HZJZ8hklnwCBhq=w983-h652-no?authuser=0

Note I don’t know the distinction in web server tech that causes this (as sometimes you can just grab a CSV via url, here is an example I have grabbing PPP loan data or with the NIJ recidivism data).

But we can grab the data via requests, and use the same StringIO trick I just showed to get this data:

# Using string IO for reading text
import requests
res_cary = requests.get(cary_url)
cary_df = pd.read_csv(StringIO(res_cary.text))
cary_df

AL9nZEXgI9O-zkpoflAWizxZJJfzC1_8ebKHra-5WQ9_h3prjuZKMwfBHkrh0X2mBb5OE9BUd0VaUtfSkfYsIlAqvAmBEYmMna95JqJ1Ijh_alz4I42sJRKxn_IZxeYEA61VX7N65kXkowRBCeM2gtNcwMtd=w1466-h536-no?authuser=0

Again I don’t know why some servers you need to go through this approach, but this works for me for Socrata and CartoDB api’s for different cities open data. I also used in recently for converting geojson in ESRI’s api.

The second example I want to show is downloading zipfiles. For this, we will use io.BytesIO instead of StringIO. The census stores various data in zipfiles on their FTP server:

# Example 2, grabbing zipped contents
import zipfile
from io import BytesIO

census_url = 'https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/2019_5yr_Summary_FileTemplates.zip'
req = requests.get(census_url)

# Can use BytesIO for this content
zf = zipfile.ZipFile(BytesIO(req.content))

The zipfile library would be equivalent to reading/extracting a zipfile already on disk. But when downloading there is no need to save to disk, then deal with that file. BytesIO here cuts out the middleman.

Then we gan either grab a specific file inside of our zf object, or extract all the contents one-by-one:

# Now can loop through the list
# or grab specific file
zf.filelist[0]
temp_geo = pd.read_excel(zf.open('2019_SFGeoFileTemplate.xlsx'))
temp_geo.T

AL9nZEUIgFvEQA8Vbz979JZzxUxiVTZjDagSnP3YKcKfnRVsYTBE3S3zOEwixfjb52ziB8_3Lzd9d7k0UGdVHGjr1BTeaRWTQfju6WvhCX8DxndzCsE3N-eXP2xqx2eIg57IwunUIoeJEwQx8bjoPulSgdCO=w1198-h849-no?authuser=0

Using IO objects in python to read data

Using IO objects in python to read data

Recommend

Today's Wordle Answer #502 - November 3, 2022 Solution And Hints

Elon Musk's Latest Twitter Poll Puts Advertisers On The Spot

“元宇宙”一周年踟蹰前行

GR55 report - Budget/ Actual / Commitment report (6OBU)

Web3 时代的 ZKP 如何迈向主流？

台媒：台积电将赴美国设厂，引发台湾社会担忧

Consumer Software UX and NPS Benchmarks (2022)

跨境电商、视频电商如何造就电动两轮车销售新范式？

文旅部重磅发声！在线旅游板块闻声大涨，众信旅游、中国中免等多股涨停

Twitter's New Verification Badge Fee May Be Only Days Away

About Joyk