9

Data Scraping Google Search Results Using Python and Scrapy

 2 years ago
source link: https://hackernoon.com/how-to-scrape-data-from-google-search-using-python-and-scrapy
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Site Color

hex

Text Color

Ad Color

hex

Text Color

Evergreen

Duotone

Mysterious

Classic

Sign Up to Save Your Colors

Data Scraping Google Search Results Using Python and Scrapy by@scraperapi

Data Scraping Google Search Results Using Python and Scrapy

Python and Scrapy combine to create a powerful duo that we can use to scrape almost any website. Scraping Google SERPs (search engine result pages) is as straightforward or as complicated as the tools we use. For this tutorial, we’ll be using a web scraping framework designed for Python. We can use it to know our positions in Google better and benchmark ourselves to the competition. For example, for this example we'll build our Google web scraper to collect competitors’ reviews.

Listen to this story

Speed:
Read by:
voice-avatar
Scraper API

Scraper API is a software tool that allows companies to collect data from web pages with an API call.

Scraping Google SERPs (search engine result pages) is as straightforward or as complicated as the tools we use. For this tutorial, we’ll be using Scrapy, a web scraping framework designed for Python. Python and Scrapy combine to create a powerful duo that we can use to scrape almost any website.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Scrapy has many useful built-in features that will make scraping Google a walk in the park without compromising any data we would like to scrape.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

For example, with Scrapy all it takes is a single command to format our data as CSV or JSON files – a process we would have to code ourselves otherwise.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Before jumping into the code itself, let’s first explore a few reasons a Google scraper can be useful.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Why Scrape Google?

There’s no dispute, Google is the king of search engines. That means there’s a lot of data available in its search results for a savvy scraper to take advantage of.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Here are a few applications for a Google scraper:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Collecting Customer Feedback Data to Inform Your Marketing

In the modern shopping experience, it is common for consumers to look for product reviews before deciding on a purchase.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

With this in mind, a powerful application for a Google SERPs scraper is to collect reviews and customer feedback from your competitor’s products to understand what’s working and what’s not working for them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

It can be to improve your product, find a way to differentiate yourself from the competition or to know which features or experiences to highlight in your marketing.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Keep this in mind because we’ll be building our scraper around this issue exactly.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Inform Your SEO and PPC Strategy

According to Oberlo, “Google has 92.18 percent of the market share as of July 2019” and it “has been visited 62.19 billion times this year”. With that many eyes on the SERPs, getting your business to the top of these pages for relevant keywords means a lot of money.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Web scraping is primarily an info-gathering tool. We can use it to know our positions in Google better and benchmark ourselves to the competition.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If we look at our positions and compare ourselves to the top pages, we can generate a strategy to outrank them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The same goes for PPC campaigns. Because ads appear at the top of every SERP – and sometimes at the bottom – we tell our scraper to bring the name, description, and link to all the ads appearing at the top of the search results for our targeted keywords. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png

This research will help us find un-targeted keywords, understand our competitor’s strategies and evaluate the copy of their ads to differentiate ours.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Generate Content Ideas

Google also has many additional features in their SERPs like related searches, “people also ask” boxes and more. Scraping hundreds of keywords allows you to gather all this information in a couple of hours and organize it in an easy-to-analyze database.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

These are just use cases. Depending on the type of data and the ultimate goal you have, you can use a Google scraper for many different reasons.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How to Build a Google Web Scraper Without Getting Blocked

As we stated earlier, for this example we’ll build our Google web scraper to collect competitors’ reviews. So, let’s imagine we’re a new startup building a project management software, and we want to understand the state of the industry.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Let’s start from there:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

1. Choose Your Target Keywords

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Now that we know our main goal, it’s time to pick the keywords we want to scrape to support it.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To pick your target keywords, think of the terms consumers could be searching to find your offering, and identify your competitors. In this example, we’ll target four keywords:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

“asana reviews”“clickup reviews”“best project management software”“best project management software for small teams”

0 reactions
heart.png
light.png
money.png
thumbs-down.png

We could add many more keywords to this list, but for this scraper tutorial, they’ll be more than enough.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Also, notice that the first two queries are related to direct competitors, while the last two will help us identify other competitors and get an initial knowledge of the state of the industry.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

2. Setup Your Development Environment

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The next step is to get our machine ready to develop our Google scraper. For this, we’ll need a few things:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Python version 3 or laterPip – to install Scrapy and other packages, we might need ScraperAPI.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Your machine may have a pre-installed Python version. Enter 

python -v  
into your command prompt to see if that’s the case.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you need to install everything from scratch, follow our Python and Scrapy scraping tutorial. We’ll be using the same setup, so get that done and come back.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Note: something to keep in mind is that the team behind Scrapy recommends installing Scrapy in a virtual environment (VE) instead of globally on your PC or laptop. If you’re unfamiliar, the above Python and Scrapy tutorial shows you how to create the VE and install all dependencies.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In this tutorial, we’re also going to be using ScraperAPI to avoid any IP bans or repercussions. Google doesn’t really want us to scrape their SERPs – especially for free. As such, they have implemented advanced anti-scraping techniques that’ll quickly identify any bots trying to extract data automatically.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To get around this, ScraperAPI is a complex system that utilizes third-party proxies, machine learning, huge browser farms, and years of statistical data to ensure that our scraper won’t get blocked from any site by rotating our IP address for every request, setting wait times between requests and handling CAPTCHAs.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In other words, by just adding a few lines of code, ScraperAPI  will supercharge our scraper, saving us headaches and hours of work.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

All we need for this tutorial is to get our API Key from ScraperAPI. To get it, just create a free ScraperAPI account to redeem 5000 free API requests.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

3. Create Your Project’s Folder

0 reactions
heart.png
light.png
money.png
thumbs-down.png

After installing Scrapy in your VE, enter this snippet into your terminal to create the necessary folders:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com

Scrapy will first create a new project folder called “google-scraper,” which also happens to be the project’s name. Next, go into this folder and run the

genspider
command to create a web scraper named “google”.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

We now have many configuration files, a “spiders” folder containing our scraper, and a Python modules folder containing package files.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

4. Import All Necessary Dependencies to Your google.py File

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The next step is to build a few components that will make our script as efficient as possible. To do so, we’ll need to make our dependencies available to our scraper by adding them at the top of our file:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'

With these dependencies in place, we can use them to build requests and handle JSON files. This last detail is important because we’ll be using ScraperAPI’s autoparse functionality.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

After sending the HTTP request, it will return the data in JSON format, simplifying the process and making it so that we don’t have to write and maintain our own parser.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

5. Construct the Google Search Query

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Google employs a standard and query-able URL structure. You just need to know the URL parameters for the data you need and you can generate a URL to query Google with.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

That said, the following makes up the URL structure for all Google search queries:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

http://www.google.com/search

0 reactions
heart.png
light.png
money.png
thumbs-down.png

There are several standard parameters that make up Google search queries:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

q represents the search keyword parameter.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

http://www.google.com/search?q=tshirt, for example, will look for results containing the keyword “tshirt.”

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The offset point is specified by the start parameter. http://www.google.com/search?q=tshirt&start=100 is an example. hl is the language parameter. http://www.google.com/search?q=tshirt&hl=en is a good example. The as_sitesearch argument allows you to search for a domain (or website). http://www.google.com/search?q=tshirt&as sitesearch=amazon.com is one example.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The number of results per page (maximum is 100) is specified by the num parameter. http://www.google.com/search?q=tshirt&num=50 is an example. The safe parameter generates only “safe” results. http://www.google.com/search?q=tshirt&safe=active is a good example.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Note: Moz’s comprehensive list of google search parameters is incredibly useful in building a query-able URL. Bookmark it for more complex scraping projects in the future.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Alright, let’s define a method to construct our Google URL using this information:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def create_google_url(query, site=''):
    google_dict = {'q': query, 'num': 100, }
    if site:
        web = urlparse(site).netloc
        google_dict['as_sitesearch'] = web
        return 'http://www.google.com/search?' + urlencode(google_dict)
    return 'http://www.google.com/search?' + urlencode(google_dict)

In our method, we’re setting ‘q’ as query because we’ll specify our actual keywords later in the script to make it easier to make changes to our scraper.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

6. Define the ScraperAPI Method

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To use ScraperAPI, all we need to do is to send our request through ScraperAPI’s server by appending our query URL to the proxy URL provided by ScraperAPI using payload and urlencode. The code looks like this:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def get_url(url):
   payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
   proxy_url = 'http://api.scraperapi.com/?' + 
   urlencode(payload)
return proxy_url

Now that we have defined the logic our scraper will use to construct our target URLs, it’s time to build the main spider.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

7. Write the Spider Class

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In Scrapy we can create different classes, called spiders, to scrape specific pages or groups of sites. Thanks to this function, we can build different spiders inside the same project, making it much easier to scale and maintain.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                       'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
                       'RETRY_TIMES': 5}

We need to give our spider a name, as this is how Scrapy will determine which script you want to run. The name you choose should be specific to what you’re trying to scrape, as projects with multiple spiders can get confusing if they aren’t clearly named. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Because our URLs will start with ScraperAPI’s domain, we’ll also need to add “api.scraper.com” to allowed_domains. ScraperAPI will change the IP address and headers between every retry before returning a failed message (which doesn’t count against our total available API calls).

0 reactions
heart.png
light.png
money.png
thumbs-down.png

We also want to tell our scraper to ignore the directive in the robots.txt file. This is because by default Scrapy won’t scrape any site which has a contradictory directive inside said file.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

 Finally, we’ve set a few constraints so that we don’t exceed the limits of our free ScraperAPI account. As you can see in the custom_settings code above, we’re telling ScraperAPI to send 10 concurrent requests and to retry 5 times after any failed response.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

8. Send the Initial Request

0 reactions
heart.png
light.png
money.png
thumbs-down.png

It’s finally time to send our HTTP request. It is very simple to do this with the start_requests(self) method:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
def start_requests(self):
        queries = ['asana+reviews', 
     'clickup+reviews', 
     'best+project+management+software', 
'best+project+management+software+for+small+teams']
            url = create_google_url(query)
            yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})

It will loop through a list of queries that will be passed to the create_google_url function as query URL keywords.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The query URL we created will then be sent to Google Search via the proxy connection we set up in the get_url function, utilizing Scrapy’s yield. The result will then be given to the parse function to be processed (it should be in JSON format). The {‘pos’: 0} key-value pair is also added to the meta parameter, which is used to count the number of pages scraped.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Note: when typing keywords, remember that every word in a keyword is separated by a + sign, rather than a space.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

9. Write the Parse Function

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Thanks to ScraperAPI’s auto parsing functionality, our scraper should be returning a JSON file as a response to our request. Make sure it is by enabling the parameter ‘autoparse’: ‘true’ in the get_url function.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Next, we’ll load the complete JSON response and cycle through each result, taking the data and combining it into a new item that we can utilize later.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

This procedure checks to see whether another page of results is available. The request is invoked again if an additional page is present, repeating until there are no additional pages.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
   def parse(self, response):
        di = json.loads(response.text)
        pos = response.meta['pos']
        dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        for result in di['organic_results']:
            title = result['title']
            snippet = result['snippet']
            link = result['link']
            item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
            pos += 1
            yield item
        next_page = di['pagination']['nextPageUrl']
        if next_page:
            yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

10. Run the Spider

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Congratulations, we built our first Google scraper! Remember, our code can always be changed to add functionality we discover is missing, but for now we have a functional scraper. If you’ve been following along, your google.py file should look like this by now:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'
def get_url(url):
    payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url
def create_google_url(query, site=''):
    google_dict = {'q': query, 'num': 100, }
    if site:
        web = urlparse(site).netloc
        google_dict['as_sitesearch'] = web
        return 'http://www.google.com/search?' + urlencode(google_dict)
    return 'http://www.google.com/search?' + urlencode(google_dict)
class GoogleSpider(scrapy.Spider):
    name = 'google'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'ROBOTSTXT_OBEY': False, 'LOG_LEVEL': 'INFO',
                       'CONCURRENT_REQUESTS_PER_DOMAIN': 10, 
                       'RETRY_TIMES': 5}
    def start_requests(self):
        queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']
        for query in queries:
            url = create_google_url(query)
            yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
    def parse(self, response):
        di = json.loads(response.text)
        pos = response.meta['pos']
        dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        for result in di['organic_results']:
            title = result['title']
            snippet = result['snippet']
            link = result['link']
            item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
            pos += 1
            yield item
        next_page = di['pagination']['nextPageUrl']
        if next_page:
            yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})

Note: If you want to scrape Google SERPs from different countries (let’s say Italy), all you need to do is change the code inside the country_code parameter in the get_url function. Check out our documentation to learn every parameter you can customize in ScraperAPI.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To run our scraper, navigate to the project’s folder inside the terminal and use the following command:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
scrapy crawl google -o serps.csv

Now our spider will run and store all scraped data in a new CSV file named “serps.” This feature is a big-time saver and one more reason to use Scrapy for web scraping Google.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The stored data can then be analyzed and used to provide insight for tools, marketing and more.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you’d like to dive deeper into web scraping with Python, check our Python and Beautiful Soup tutorial. Beautiful Soup is a simpler web scraping framework for Python that’s just as powerful for scraping static pages.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

To make the most out of ScraperAPI, take a look at our web scraping and ScraperAPI best practices cheat sheet. You’ll learn about the most common challenges when scraping large sites and how to overcome them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Happy scraping!

0 reactions
heart.png
light.png
money.png
thumbs-down.png

This post was originally published on ScraperAPI.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
7
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Scraper API @scraperapi.Scraper API is a software tool that allows companies to collect data from web pages with an API call.
Proxy API for Web Scraping
Customized Experience.|

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK