The business of real-time data - is there an ethical approach to web scraping?

June 21, 2023

Dyslexia mode

It's extremely rare for me to turn an email Q/A into an article. It's just too hard to avoid sound bites. But I was intrigued by the assertions I received from Oxylabs.io about the impact of real-time web data.

Can real-time data be a source of competitive intelligence? The answer is: obviously. But the more important questions are:

1. How do you apply real-time data to competitive intelligence?
2. What industry sectors are finding success?
3. What about the privacy and ethical issues such projects are sure to smack into?

Meet my email Q/A foil, Aleksandras Šulženko, Product Owner at Oxylabs.io. Oxylabs has a provocative value prop: "Collect public data at scale with industry-leading web scraping solutions and the world’s largest ethical proxy network."

We starting chatting after Šulženko's session at the AI and Big Data Expo in Santa Clara. I was specifically interested in how real-time/public web data fits into the competitive intelligence mix. Sulženko explained that financial services companies are early adopters here.

Web scraping - and the ROI of data gathering

A survey by Oxylabs and Censuswide of 1000+ financial services decision makers found that 44% plan to invest more into web scraping in the coming years - that's more than any other data gathering method. Why? As Šulženko wrote to me:

This is no surprise since a quarter (26%) of respondents said web scraping had the greatest impact on revenue compared to other data gathering methods.

Just the term "web scraping" has an off-putting aspect to it, at least to me - and I got into the ethics of this with Šulženko. But first, what are the issues with gathering this type of data? How is it done? Šulženko:

Gathering web data is a challenging process in general. Firstly, to gather any web data, you will need to figure out what URLs you want to access. This can be done either by generating URLs (if they follow a certain pattern) or by crawling a site to figure out what URLs are present on it. Once you have the URLs, you may attempt to fetch the content from the web. The content will usually be in HTML format, so the next step is to parse the HTML into a simpler data structure, such as JSON or CSV, containing only the data points of interest. In the case of real-time data, complexity adds up as there is no room for error: the system must be up and running at all times.

Scraping accurate data is a real problem:

One of the biggest challenges is gathering accurate data, as wrong content comes in many different ways. Some scraping responses might seem legit, although they contain CAPTCHAS or, even worse, false information from the so-called honey pots. Websites can also track and block scrapers based on fingerprints, which include the IP address, HTTP headers, cookies, JavaScript fingerprint attributes, and other data.

And yes, anti-scraping measures are out there:

Anti-scraping measures and browser fingerprinting are becoming increasingly sophisticated. To avoid unwanted interruptions, companies have to play with different parameter combinations for different sites, which again increases the complexity of their data gathering solution.

Is there such a thing as legitimate web scraping?

Some might argue that all web scraping is unethical. But Šulženko makes a firm distinction between malicious actors and "legimate scrapers."

By the way, getting blocked by an anti-scraping solution does not mean that web scraping is a bad or illegitimate action. With anti-scraping measures, websites simply try to secure their servers from request overload and actions done by irresponsible or malicious actors. Separating between these malicious actors and legitimate scrapers would be exceedingly difficult, so administrators just push a blanket ban on both. Sometimes, the data is locked because of the location – many sites show different content in different countries. However, if a company is collecting competitor intelligence, for example, product prices, it needs to gather data in various locations. It would be impossible without an extensive proxy network.

When you're tracking real-time, you must adapt real time also:

When parsing data, the main challenge is adapting to the constant layout changes of the web pages. This requires constant maintenance of parsers – a task that is not particularly difficult but highly time-consuming, especially if the company is scraping many different page types.

Comparing data fields like online pricing also poses semantic problems:

Yet another interesting challenge when gathering public data from e-commerce marketplaces is product mapping. Imagine a company that needs to gather prices and reviews of five different models of Samsung headphones. In different online marketplaces, such products can be listed in different departments and subcategories or have slightly different product names. This makes it difficult to track the same product across multiple e-commerce sites, even with the use of scraping.

From questionable large language model behavior to pro bono web scraping

In my view, the training of large language models (LLMs) on the open Internet is itself a form of scraping. I would argue that, at least to date, that's mostly been an unethical scrape, including plenty of copyrighted and/or creative material that was not opted-in, or properly compensated. Case in point: a major component of Reddit's controversial API monetization decision is Reddit leadership's determination to be compensated by the LLMs that scrape Reddit's site for their training models.

With my fairly cynical view of web scraping, it was interesting to hear from Šulženko about what you might call scraping-for-good scenarios - none of which I had considered.

Non-profit organizations often have really interesting research topics that allow employing web scraping technology for the common good. For example, Oxylabs has been working on a pro bono initiative with the Communications Regulatory Authority in Lithuania to create an AI-powered tool for fighting illegal content online (mainly related to child sexual abuse).

We see a huge uncovered potential in such use cases, but they need more support to raise visibility and awareness. We have launched a free-of-charge initiative called Project4B, the aim of which is to transfer technical expertise and grant universities and NGOs free access to web intelligence collection tools.

During our back-and-forth, Šulženko asserted that public web scraping boosts the revenues of financial services companies. I wanted to know: how? By identifying prospects?

Financial service companies and investment banks were the early adopters of the web scraping technology. Several years ago, some investment firms started using alternative data sources instead of relying on traditional ones like government reports and company financial statements. The alternatives included satellite imagery, mobile app data, public web intelligence, etc., with the latter being the most popular one.

There are different ways in which financial service companies gain value via web scraping, with identifying investment prospects being one among many. For example, data from stock tracking websites can be analyzed together with public data scraped from investment forums to identify connections between investor (as an aggregate group) sentiment and the shifting value of stocks or other financial instruments. Interactive Brokers has already started to produce reports that include daily stock tickers mentioned the most times on Reddit’s r/wallstreetbets subreddit.

Sidenote: I wonder if Reddit's leadership, given their stance against LLMs, would consider r/wallstreetbets fair game for "public" web scraping, or would they differ on this? Šulženko says real-time sentiment analysis is another solid use case:

Web intelligence-powered investor sentiment analysis can also alert investors about emotions and biases that can influence their decisions. CNN has developed a Fear & Greed Index, which is based on the idea that investors tend to be emotional and reactionary. The index is used to measure the mood in the market and calculate fear.

Analyzing macro-economic policy impact is another angle:

Alternative data, such as mobility data or prices of consumer goods scraped from the web, can also help to nowcast economic indicators, such as inflation and price index, and aid in evaluating the effects of macroeconomic policies.

The ethics of web scraping

Now for the fun stuff: I asked Šulženko, What are the privacy issues? Does web scraping detect trends, or does it operate more by pre-qualifying or ruling out individuals based on their digital exhaust?

In a nutshell, web scraping is not about non-public data. Most businesses and other organizations, such as academia, need big data for aggregated analysis, and public data is often more than sufficient for this purpose.

Reputable scraping providers usually don’t extract data behind logins (legitimate exceptions might exist in such industries as cybersecurity). Most popular web scraping use cases are gathering product and service prices for optimizing pricing strategies, performing competitor analysis, extracting market trends, and analyzing consumer sentiment to improve product offering and sales. None of these cases need private data to get valuable business insights.

Fair enough, though I can think of a fair amount of "public" sites with copyrighted data that aren't crazy about getting scraped - diginomica amongst them. For the record, Šulženko advises customers to get legal advice on these matters rather than scrape copyrighted content. I asked him: could you offer some insights into the ethical and legal aspects/implications of web scraping for companies to consider?

The web scraping industry is relatively new and developing rapidly. Legal regulation and case law, on the other hand, is still lagging behind, which creates a lot of obscurity around web data gathering practices and also some myths that portray data scraping as a bad action. It is, however, a normal practice that simply helps individuals and organizations gather massive amounts of publicly available data scattered around the web.

There are a few things that may determine whether web scraping actions are legal or not. The first question to consider is what kind of website will be scraped and what kind of data will be gathered from it. Reputable scraping providers gather public data and don't scrape data behind logins, except for some specific and rare exceptions (e.g. having permission to do so from the website and data owner).

Copyrighted data changes the web scraping rules:

Another thing to consider is how the company will be using the scraped data. Even if the data is publicly available but is copyrighted, that might also be illegal. Copyright is simply another layer that’s important to consider before engaging in any scraping. Therefore, our recommendation for companies that plan to gather web intelligence is to thoroughly evaluate their actions and always consult with legal experts.

Šulženko says ethical scraping also considers server load:

There are also distinct ethical considerations – for example, a reputable scraping company will take into account the health of the target servers and distribute its requests properly. Otherwise, the website can simply go down, or become very slow.

All proxies aren't created equal, either:

Additionally, the technology that is being used, such as proxies, can also have ethical considerations. There are proxies that are sourced ethically – for example, we get our endpoints from Honeygain – and proxies that are not.

It is crucial to educate industry players and the general public about ethical data gathering. For this goal, Oxylabs co-founded an Ethical Web Data Collection Initiative (EWDCI), which seeks to establish the best industry practices and build trust around data scraping technologies.

My take

It goes without saying that Oxylabs is excited about the potential of ML/AI as applied to these scenarios - understandably so. But I've written plenty about AI/ML in recent weeks. In terms of Oxylabs' focus, I'm very interested in the field of competitive intelligence, including real-time information gathering. I'm less enthused, personally, about the tactic of automated web scraping, though there are some relatively harmless aspects, such as comparing pricing between product sites.

But when you have to take into account not overloading someone else's servers, that's just not the type of work that appeals to me personally. As diginomica's web project manager, I can tell you we are always on the lookout for scraping activity, and do not look on it fondly. Then again, there are many occupations I am not too interested in. To be honest, I probably wouldn't write about the business of web scraping again. Having said that, it's important to understand how scraping works, and why so many businesses are interested in it. Oxylabs was articulate about ethical scraping, as well as pro bono and non-profit implications. Those points deserve attention and debate.

The business of real-time data - is there an ethical approach to web scraping?

The business of real-time data - is there an ethical approach to web scraping?

Web scraping - and the ROI of data gathering

Is there such a thing as legitimate web scraping?

From questionable large language model behavior to pro bono web scraping

The ethics of web scraping

My take

Recommend

Run OpenTelemetry on Docker

Sensor tower：5月全球手游市场规模62.6亿美元，《星铁》空降收入榜第二

PET Scanners Market Demand, Trend Outlook and Growth Rate Analysis 2023-2032

Barycentric Coordinates -- from Wolfram MathWorld

Angular Basics: Best Practices for Creating a New Project

Microsoft Build Recap: Upgrading from Xamarin to .NET MAUI

Turning the screw with AI – ways for developer advocates to debunk the “AI can r...

聚水潭入选“2022年中国独角兽企业榜单” 是唯一上榜电商SaaS ERP企业

Every Website Owner Should Know These SEO KPIs - Big Data Analytics News

复盘知识地图

About Joyk