20

How to scrape DHGate.com with Puppeteer

 4 years ago
source link: https://areweoutofmasks.com/blog/how-to-scrape-dhgate-with-puppeteer#jquery-esque-parsing
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Scraping the internet in search of masks for sale

Looking for e-commerce platforms that still masks at normal prices, we quickly stumbled upon Chinese platforms AliExpress and DHGate.com. Both openly sell PPE, but listings change frequently, so we want to scrape them to keep our website always up-to-date (we requested API access but were not issued a key).

Scraping AliExpress is easy with the many open-source libraries out there (this one is the most thorough).

Writing my own library

But there's no such library for DHGate, so I set out to make my own, using Puppeteer, a library that remote-controls headless Chrome.

I was surprised to find more hurdles than usual while scraping DHGate.com, some quite creative. This post gives a high level overview of how I developed a scraper that gets detailed product information from DHGate, including item details, prices, stock, shipping information, and reviews. It is not a tutorial and does not give an introduction to Puppeteer.

Because the project is not open-source, some parts are left vague, but much of the content applies to scraping in general.

Interested in scraping DHGate or AliExpress but don't want to code? Email us: [email protected]

Target website feature identification

Before writing a single line of code, the target website must be fully understood.

The easiest way to identify features is to.... not identify features. Instead, I started by building a data model. During the design of the model, all kinds of questions popped up that needed to be answered by clicking around. There is sadly no way of knowing that answers collected this way are 100% correct because they're derived from limited source data.

Feature identification should not be underestimated. Even after clicking on probably a hundred mask listings, I still found new features by accidentally clicking on a t-shirt listing.

Reading the whole HTML and JS source code can also provide clues. Look for attributes on DOM elements that aren't rendered and take special notice of any hardcoded JSON inside <script> tags, and search for "http" in obfuscated JavaScript to discover APIs. Breadcrumbs are available to us in nicely formatted JSON-LD:

dhgate_json_ld.png

JSON-LD breadcrumbs on DHgate.com

Some discoveries

Items can have multiple SKUs. For example, a product could come in different colors.

Different SKUs of DHGate.com listing

Can an item be stocked in multiple locations? Possible but rare If an item is stocked in more than one location, is the price of the item dependent on the stock location Yes for AliExpress, No for DHGate Does changing the SKU of the listing have an impact on shipping data or stock data? No, shipping and stock data is only dependent on product ID, not SKU. Does shipping price depend on item quantity? Originally I thought not, but I was wrong. It depends on the listing. Luckily the legitimate mask listings all have free shipping there's no need to implement dynamic shipping pricing right now. Only found out about this after the scraper was already deployed to production. Is the timestamp printed on reviews in the browsers timezone, the IP location timezone, or some central timezone? Is stock information present for all listings? No, but for almost all mask listings this seems to be the case. There are multiple prices, depending on the item quantity. Is that implemented like tax brackets, or do you actually get the same price on every item? Every item costs the same. It can be cheaper to buy 10 items instead of 9. Does the platform use dynamic pricing? Will I get a different price when I'm logged? AliExpress runs promotions for new users with discounted prices. DHGate doesn't seem to do this. The prices seem to be quite fixed

For every newly discovered feature, I saved the URL in a text file, for eventual testing of the scraper.

Finalizing the data model

Postgres is an easy choice for the database and since the project will likely not scrape more than a few hundred thousand listings, there should be no scaling concerns.

I decided to treat (product_id, sku_id) as the primary key, as the SKU ID has an impact on prices (and on AliExpress also has an impact on stock count and I want to keep the data models as similar as possible).

General product info, prices, shipping, stock, reviews, all have their own tables. At this time I also had to plan how to handle updating of data. Some information like prices are replaced on every scrape (delete + insert) while other information like reviews is merged (insert .. on conflict do update).

Let's Puppeteer

Time to useautomate the browser.

A REPL workflow is handy for early stage development. This medium post describes how to use the Chrome Dev Tools with Node.js debugging so you can write puppeteer code line-by-line and evaluate the results immediately (you'll need to invoke node with the --experimental-repl-await flag). Neat.

For development purposes, I'm using Puppeteer with the headless flag set to false, so I can visually see what's happening on the page.

Using Puppeteer, DHgate greets you with this:

DHGate.com Scraping Error Page (HTTP 403) Access Denied

ACCESS DENIED

I had to fix the user-agent header. While I was at it, I also implemented a mock WebDriver implementation.

More defenses

Once the page loads, I discovered the next level of scrape protection: The page loads, but it takes forever. Loading a page can take anywhere from 20 to 120 seconds. While this is fine for running a scraper (you can just run more instances), it would take forever to develop, so I had to find ways around it.

Chrome Dev Tools revealed that the slow requests are images, so I decided not to load them:

await page.setRequestInterception(true);
page.on('request', (request) => {
	const url = request.url()

	if (url.endsWith('.png') || url.endsWith('.jpg')) {
		request.abort();

Later on I found that some images use cache-busting params in the URL, so the code had to be updated to also block /someimage.png?v=2013-05

Cookies

Adding cookies also helped. Using a Chrome extension to see cookies in the real browser while logged in I tested multiple cookies by hand. Not all cookies are used in the scrape-protection, so it's pointless to implement all of them.

Another benefit of adding a session cookie is that it stops some of their A/B tests I discovered while clicking around. Being able to force a certain version is a huge win.

Presumably, setting the cookie has an impact on the load balancer ("sticky sessions") and routes you to the older parts of the system that could perhaps be less complicated. That's pure speculation though.

Still loading slowly

Ok, now the page loads a bit faster, but still not normal speed. Adding some common request headers helped. While at it, added some more normal seeming behavior: Random clicking around, scrolling up and down. Pages load reasonably fast now.

Note that puppeteer can fill out form fields by modifying the DOM directly or by emulating keyboard events.

Emulating the keyboard is more robust because some websites using React only listen to the onkeydown event, but not the change event. Another benefit is that pages could run a keylogger to detect bots not typing letters one-by-one (pretty much no website runs a keylogger though).

Extracting information

Now that we finally see an actual page, it's time to scrape data from it. Puppeteer has 2 handy features for that:

jQuery-esque parsing

My favorite function for parsing is not the popular page.evaluate(). I much prefer page.$eval() and page.$$eval()

You might remember jQuery's selector syntax $('.someclass').text(). Chrome actually implements similar $ API.

Transforming code that's written in the Chrome Dev Console using $() or $$() into Puppeteer code is trivial, and testing in the REPL allows for a very short iteration cycle. Note: If you're used to using const all the time, start going back to var/let for the REPL.

let href = $('.list-item a').getAttribute('href') // Dev Console
const href = await page.$eval('.list-item a', a => a.getAttribute('href')) // puppeteer

Here's some actual production code:

const prices = await page.$$eval('.js-wholesale-list li',
	lis => lis.map(li => ({
		price: parseFloat(li.getAttribute('price')),
		minPcs: parseFloat(li.getAttribute('nums')?.split(' ')[0]),
		maxPcs: parseFloat(li.getAttribute('nums')?.split(' ').reverse()[0])
	}))
)

const storeName = await page.$eval('li.top-seller-name', li => (li as HTMLElement).innerText)
//                                                               ^^^^^^^^^^^^^^^^^
//                                                           type-hint for TypeScript

Don't parse if you can sniff

Data that is loaded dynamically into the page can be scraped from the DOM. But what's much easier is just listening to the network request that's loading the data in the first place. Oftentimes this is nicely structured JSON, and might even include more fields than are displayed in the UI.

Executing requests in the browser environment beats using a simple networking library because the browser will automatically include cookies and other tracking parameters.

page.on('response', async response => {
	if (response.url().includes('productshippingajax.do')) { // todo: more precise pattern match using regex
		const data: any = await response.json()

To get a list of countries I could go on Wikipedia and find a list of ISO codes. But maybe DHGate doesn't adhere 100% to the ISO codes - better to extract them out of the page.

Once they are in a variable in the Dev Console, I can call copy(countryCodes) to copy them into my system clipboard.

Dealing with rate limits

Getting shipping data requires 1 request per country. To get data for all countries, a lot requests are fired very quickly, which leads to rate limit bans.

DHGate implements (at least) these firewalls:

Throttling

If you access them from an IP range they don't like, the request will simply never finish. Try running curl https://www.dhgate.com and potentially see it stuck for a minute or longer.

Bad IPs

If you have an IP from a subnet they don't like or your request headers (most likely user agent) aren't to their liking, they'll send you this:

curl -i https://www.dhgate.com
HTTP/2 403
server: AkamaiGHost
mime-version: 1.0
content-type: text/html
content-length: 263
expires: Tue, 21 Apr 2020 20:19:16 GMT
date: Tue, 21 Apr 2020 20:19:16 GMT
set-cookie: REDACTED
set-cookie: REDACTED

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http://www.dhgate.com/" on this server.<P>
Reference #18.c50b66ab.1587500356.73b2f54
</BODY>
</HTML>

Too many requests

If you sent too many requests you might also see a yellow error page with Chinese characters. The response code was still 200 OK - sneaky! Here is sample code that detects the error:

if ((await response.text()).includes('错误')) {
	console.error("Got Chinese error page - IP blocked")
	await proxy.rotate()
	continue;
}

Proxies

Slowing down requests was not enough, as scraping a single listing requires over 100 network requests. It was time to add proxies. I set up a few VPS to make my own proxies. Out of 10 providers, 3 were not banned by DHGate. Hooray!

Paying for proxies

Alternatively, free proxies can be found with ProxyBroker, but I personally prefer having stable network in exchange for paying a few bucks. Working with unstable proxies adds more complexity to the scraper and is rarely worth it (e.g. the proxy dies while getting an item and you replace with with another, but a cookie is bound to the now dead proxy).

Deploying

The entire thing is wrapped in Docker for easy deployment. Installing puppeteer on docker is a pain, but images with preinstalled Puppeteer exist in the docker hub.

To be notified when the scraper breaks, I added Sentry error reporting. Instead of a dedicated queuing system, a simple Postgres table is used.

Sounds like a lot of work? Email us if you want access to our private live-scrape API or need to generate some reports: [email protected]


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK