Wikipedia Scraping In Python – Easy Ways

Filed Under: Python

As a data analyst or a scientist, you need to collect the data before any analysis. Sometimes, you will get the data directly from the company’s database. But, it is not the same case all the time. You may be required to scrape the web to get some data for a particular analysis. Well, here is the solution – Wikipedia scraping using python. It is relatively very simple to scrape Wikipedia data using python. In this tutorial, we will see how we can scrape data in under 5 mins and with less than 10 lines of code.

Sound’s interesting?

Let’s dive deep.

More read: Pandas Pivot table in Python – A brief Introduction

1. About the Source

I will provide some information regarding the source of our data. It is very important to first identify where the required data is located.

In our case, our data which is related to NBA finals is located on the wiki website. We have 2 tables on that same page. Let’s see how we can scrape any one of the tables using some basic HTML knowledge.

You can find the link to the webpage as well as the screenshots of the tables – Wikipedia web page.

Table 1 – Final appearances. This table includes the data of various teams and their final’s year as well along with other attributes.

2. Import the Libraries

First, we need to import the required libraries for web scraping in python. We require 3-4 libraries –

Pandas
Numpy
Matplotlib
Unicode data

#Import the libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from unicodedata import normalize

Well, we will be using the pandas read_html() function to parse the HTML tags to get the desired data. I hope you have imported all these libraries. If yes, we are good to go.

3. Read the Data

To read the HTML data, we need to create and table object using the function read_html as shown below. Call the function, add the link to the web page and then mention the table name which is to be scraped.

#scraping

NBA_data_scraped = pd.read_html('https://en.wikipedia.org/wiki/NBA_Finals', match='Finals appearances')

In this code, we are requesting the first table to be printed out.

#View data

data = NBA_data_scraped[1]

data

That’s awesome 😛

This is just a simple illustration of web scraping in python. We do have many advanced scraping libraries such as scrapy.

But, you need to get a hang of the basic HTML tags and parsing the data from public sites such as a wiki.

All I can say is, this simple application using basic python modules can serve many purposes effectively. Whenever you require a simple dataset from a public source, you can use this method to get the data in minutes.

I hope you found this tutorial helpful.

Wrapping Up – Web scraping in Python

Web scraping in python is a fascinating area. Whenever you need to collect data that is not available in your databases, it is the go-to way. As I said before, we do have many advanced web scraping libraries in python such as Scrapy. But this is a simple tutorial on web scraping using basic python modules.

In the next tutorial, We will see how we can scrape complex datasets from parts of the web using some advanced libraries.

That’s all for now. Happy Python!!!

More read: Beautiful Soup

Wikipedia Scraping In Python - Easy Ways - JournalDev

Wikipedia Scraping In Python – Easy Ways

1. About the Source

2. Import the Libraries

3. Read the Data

Wrapping Up – Web scraping in Python

Recommend

深圳机场：2月1日24时起，进入航站楼须持48小时核酸检测阴性证明

深圳：自2月1日24时起离深须持有48小时内核酸检测阴性证明

2月1日全国预计发送旅客1170万人次

春节档总票房破14亿

杭州今日0-14时新增6例本土新冠肺炎确诊病例

RCEP生效满月，广西凭祥铁路口岸进出口货物同比增长超六成

2月1日新闻联播速览

卫健委：2025年二级及以上综合医院设置老年医学科的比例不低于60%

Building a part of LibreOffice on Windows using only Meson and WrapDB

小鹏汽车1月交付12922辆，同比上涨115%

About Joyk