How to Scrape Google News Results with Node JS

This post will teach us to scrape Google News results with Node JS using Unirest and Cheerio.

Requirements:

Web Parsing with CSS selectors

Scraping the tags from the HTML files is not only a difficult thing to do but also a time-consuming process. It is better to use the CSS Selectors Gadget for selecting the perfect tags to make your web scraping journey easier.

This gadget can help you to come up with the perfect CSS selector for your need. Here is the link to the tutorial, which will teach you to use this gadget for selecting the best CSS selectors according to your needs.

User Agents

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.

You can also rotate User Agents, read more about this in this article: How to fake and rotate User Agents using Python 3.

If you want to further safeguard your IP from being blocked by Google, you can try these 10 Tips to avoid getting Blocked while Scraping Google.

Install Libraries

To start scraping Google News Results we need to install some NPM libraries to move forward.

Node JS
Unirest JS
Cheerio JS

So before starting, we have to ensure that we have set up our Node JS project and installed both the packages - Unirest JS and Cheerio JS. You can install both packages from the above link.

Target:

Process:

As stated above in the section Requirements, we will use Unirest JS for scraping HTML data and Cheerio JS for parsing extracted HTML data.

Here is the full code:

const unirest = require("unirest");
  const cheerio = require("cheerio");

  const getNewsData = () => {
  return unirest
    .get("https://www.google.com/search?q=football&gl=us&tbm=nws")
    .headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    })
    .then((response) => {
      let $ = cheerio.load(response.body);

      let news_results = []; 

      $(".BGxR7d").each((i,el) => {
        news_results.push({
         link: $(el).find("a").attr('href'),
         title: $(el).find("div.mCBkyc").text(),
         snippet: $(el).find(".GI74Re").text(),
         date: $(el).find(".ZE0LJd span").text(),
         thumbnail: $(el).find(".NUnG9d img").attr("src")
        })
      })
      
    console.log(news_results)
    });
  };

  getNewsData();

Or you can copy this code from the following link for better understanding: GoogleNewsScraper.

Code Explanation:

First, we declare constants from libraries:

const unirest = require("unirest");       
const cheerio = require("cheerio");                                                                  `

Next, we used Unirest JS for making a get request to our target URL which in this case is:

https://www.google.com/search?q=Badminton&gl=us&tbm=nws

We will make this request by passing the headers to the URL, which in this case is User-Agent.

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.


   .headers({
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
        })

You can also pass the proxy URL while making the request like this:

    .get("https://www.google.com/search?q=Badminton&gl=us&tbm=nws")
    .headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
    })
    .proxy("PROXY URL")

Here "PROXY URL" refers to the proxy server URL you will use for making the requests. It can help you in hiding your actual IP address which means the website you are scraping will not be able to identify your actual IP address, thus saving you from being blocked. Then we load our response in the Cheerio variable and initialize an empty array "news_results" to store our data.

Then we load our response in the Cheerio variable and initialize an empty array news_results to store our data.

    .then((response) => {
        console.log(response.body)
        let $ = cheerio.load(response.body);
        let news_results = [];

Web Scraping Google News Results With Node JS 3

You can see that every news article is contained this BGxR7d tag. By searching in this container, you will get the tag for the title as mCBkyc, description as GI74Re, date as ZE0LJd span , and for the image as NUnG9d img.

And then a parser to get the required information:

  $(".BGxR7d").each((i,el) => {
        news_results.push({
         link: $(el).find("a").attr('href'),
         title: $(el).find("div.mCBkyc").text().replace("\n",""),
         snippet: $(el).find(".GI74Re").text().replace("\n",""),
         date: $(el).find(".ZE0LJd span").text(),
         thumbnail: $(el).find(".NUnG9d img").attr("src")
        })
      })

Result:

Web Scraping Google News Results With Node JS 4

Our result should look like this 👆🏻.

With Google News API

If you don't want to code and maintain the scraper in the long run then you can definitely try a Google search API.

 const axios = require('axios');

  axios.get('https://api.serpdog.io/news?api_key=APIKEY&q=football&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

Result:

Conclusion:

In this tutorial, we learned to scrape Google News Results using Node JS. Feel free to message me anything you need clarification on. Follow me on Twitter. Thanks for reading!

Additional Resources

Also published here.

Requirements:

Web Parsing with CSS selectors

User Agents

Install Libraries

Target:

Process:

Code Explanation:

Result:

With Google News API

Result:

Conclusion:

Additional Resources

Recommend

Will Solana SOL price go back up on Toon Finance's Toon Swap – CryptoMode

美国科技企业裁员潮背后：战略转型和周期阵痛

As Meme Coins Dogecoin (DOGE) And ApeCoin (APE) Lose Hype, Flasko (FLSK) Reaches...

基于Xilinx MPSoC FPGA视频教程第四部分—Vitis HLS开发

MRP Live Vs classic (As per SAP note 2640393)

Rian Johnson on Knives Out's Franchise Future, Avoiding Fatigue

Apple Engineer Explains Why AirPods Pro Don't Support Lossless Audio

Hubble Snaps Stunning Pictures Of Colliding Galaxies

国内足球APP设计拆解分析

Elon Musk Made This Video Game When He Was 12 Years Old. Here's How You Can Play...

About Joyk