8

How POST Requests with Python Make Web Scraping Easier

 3 years ago
source link: https://hackernoon.com/how-post-requests-with-python-make-web-scraping-easier-9i203511
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How POST Requests with Python Make Web Scraping Easier

8
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png

@otaviossOtávio Simões

Economist and data enthusiast

When scraping a website with Python, it’s common to use the

urllib
or the
Requests
libraries to send
GET
requests to the server in order to receive its information. 
0 reactions
heart.png
light.png
money.png
thumbs-down.png

However, you’ll eventually need to send some information to the website yourself before receiving the data you want, maybe because it’s necessary to perform a log-in or to interact somehow with the page.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To execute such interactions, Selenium is a frequently used tool. However, it also comes with some downsides as it’s a bit slow and can also be quite unstable sometimes. The alternative is to send a

POST
request containing the information the website needs using the request library.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

In fact, when compared to Requests, Selenium becomes a very slow approach since it does the entire work of actually opening your browser to navigate through the websites you’ll collect data from. Of course, depending on the problem, you’ll eventually need to use it, but for some other situations, a

POST
request may be your best option, which makes it an important tool for your web scraping toolbox.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

In this article, we’ll see a brief introduction to the

POST
method and how it can be implemented to improve your web scraping routines.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Web Scraping 

Although

POST
requests are commonly used to interact with APIs, they are also useful to fill HTML forms in a website or perform other actions automatically. 
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Being able to perform such tasks is an important ability while web scraping, as it’s not rare to have to interact with the web page before reaching the data, you’re aiming to scrape.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Identifying an HTML Form

Before you start sending information to the website, you first need to understand how it will receive such information. Let’s say the idea is to log in to your account. If the site receives the username and password through an HTML form, it will probably look like this:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If that’s the case, all you have to do is send the username and the password within your

POST
request.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

But how to identify and even see what the HTML form looks like? For this, we can go back to our old friend: the

GET
request. With a
GET
and using BeautifulSoup to parse the HTML, it’s easy to see all the HTML forms on the page and how each of them looks like.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

This is a simple code for this task:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
import requests
from bs4 import BeautifulSoup
page = requests.get('http://website.com').text
soup = BeautifulSoup(page, 'html.parser')
forms = soup.find_all('form')
for form in forms:
   print(form)

And this is how our simple login form that will be the output of the code above:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
<form action="login.html" method="post"> 
User Name: <input name="username" type="text"/><br/> 
Password: <input name="password" type="text"/><br/> 
<input id="submit" type="submit" value="Submit"/>
</form>

In a form like this, the “action” is where in the website you should send your request, and the “username” and “password” are the fields you want to fill. You can also notice the type for these values is specified as text.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Submitting Your First POST 

Now it’s time to send your first

POST
request. A basic request will contain only two arguments: the URL that will receive the request and the data that you’re sending.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

The data is usually a dictionary where the keys are the names of the fields you intend to fill, and values are what you’re going to fill the fields with. The data can also be passed in different ways, but that’s a more complex approach that’s out of scope for this article.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The code is pretty simple. Actually, you can get it done with only two lines of code:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
payload = {'username': 'user', 'password': '1234'}
r = requests.post('http://website.com/login.html', data=payload)
print(r.status_code)

The third line of code is just so you can see the status code of your request. You want to see a status code of 200, which means everything is OK. To learn more about it, click here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

We can now make this process more sophisticated by implementing the

POST
request we just created into a function. Here’s how it’ll work:
0 reactions
heart.png
light.png
money.png
thumbs-down.png

1. The

post_request
function will receive two arguments: the URL and the payload to send the request.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

2. Inside the function, we’ll use a

try
and an
except
clause to have our code ready to handle a possible error.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

3. If the code doesn’t crash and we receive a response from the server, we’ll then check if this response is the one we’re expecting. If so, the function will return it.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

4. If we get a different status code, nothing will be returned, and the status will be printed.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

5. If the code raises an exception, we’ll want to see what happened, and so the function will print this exception.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

And this is the code for all this:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def post_request(url, payload):     
    try:           
        r = requests.post(url, data=payload)             
        if r.status_code == 200:                 
            return r
        else:
            print(r.status_code)
    except Exception as e:             
        print(e)

Depending on the website, however, you’ll need to deal with other issues in order to actually perform a login. The good news is that the Requests library provides resources to deal with cookies, HTTP authentications, and more that will have you covered. The goal here was just to use a common type of form as an easy example to understand for someone that had never used a

POST
request before.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Final Considerations

Especially if you’re sending a lot of requests to a particular website, you might want to insert some random pauses in your code in order not to overload the server and use even more

try
and
except 
clauses throughout your code and not only in the
post_request
function to make sure it’s prepared to handle other exceptions it may find along the way. 
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Of course, it’s also a good practice to take advantage of a proxy provider, such as Infatica, to make sure your code will keep running as long as there are requests left to submit and data to be collected, and that you and your connection are protected.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The idea of this article is to be only an introduction to

POST 
requests and how they can be useful for collecting data on the web. We basically went through how to fill out a form automatically and even how to log in to a website, but there are also other possibilities such as marking a check box or selecting items from a dropdown list, for instance, which could be subject to an entirely new article.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

I hope you’ve enjoyed this and that it can maybe be useful somehow. If you have a question, a suggestion, or just want to be in touch, feel free to be in touch.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
8
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Otávio Simões @otavioss. Economist and data enthusiastYou can read my other stories here
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK