0

How to block OpenAI from crawling your website

 1 year ago
source link: https://www.pluralsight.com/resources/blog/data/blocking-ChatGPT-OpenAI-website-crawling
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How to block OpenAI from crawling your website

Not everyone was thrilled to learn that OpenAI, the creators of ChatGPT, had been training their AI on data taken from people’s websites without permission. While it’s too late to do anything about the data they’ve already crawled, you can stop these models from being trained on your current and future content — and all it takes is two lines of code.

However, just because you can block OpenAI from crawling your website, I would highly recommend asking the question if you should. For more on that, read this article: “Leaders: Don't prematurely block OpenAI from your websites.

How ChatGPT crawls the web for content

OpenAI uses a web crawler called GPTBot to train their AI models (such as GPT-4). Web crawling is when an automated bot goes around collecting data on all the content on the internet. It happens all the time, and in fact, this is how Google works! 

How to block GPTBot from crawling your site

The code below disallows GPTBot from accessing your site, and therefore stopping it from using your content for training purposes.

First, open your website’s Robots.txt file

If you’re not familiar with this concept, a robots.txt file lives at the root of your website. So, for www.pluralsight.com, it would live at www.pluralsight.com/robots.txt. This is a document that determines if web crawlers can crawl your website, and is always publicly accessible. For instance, if you wanted to stop Google from crawling something, you’d enter in:

When ChatGPT may crawl your website, regardless of your robots.txt file

Currently, it’s unclear if web browsing versions of ChatGPT (such as “Browse with Bing”) or ChatGPT plugins will be prevented by your robots.txt file. That’s because this isn’t necessarily going through GPTBot.

What GPTBot won’t crawl, regardless of your robots.txt file

According to OpenAI, web pages crawled by the bot are filtered to “remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates (their) policies.” 

That said, it’s a gamble to rely on GPTBot to not crawl these things, so the safest bet would be to use the above methods (and maybe don’t have PII publicly searchable in the first place).

How can I tell if my website has already been crawled to train an AI?

OpenAI has been notoriously tight-lipped about what sites GPT-4, the current AI model behind ChatGPT, was trained on. For competitive reasons, OpenAI has said they will not share the details of the “architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

In short, there’s no way to tell if your website was crawled to train GPT-4, so all you can do is take the precautions listed above if you don’t want your website data crawled to train an AI model (or at least, the ones built by OpenAI).

Conclusion

From reading this article, you should have a solid understanding of how the Robots.txt file works, and how to put in an entry to block OpenAI’s bot from crawling it to train AI models. 

Further learning about ChatGPT and AI

Worried about ChatGPT? Being informed is the best way to make measured decisions on how to handle AI use at your organization. There are a number of courses that Pluralsight offers that can help you learn the ins and outs of AI — you can sign up for a 10-day free trial with no commitments. Here are some you might want to check out:

If you’re wondering how to deal with your company’s usage of ChatGPT and similar products, here are some articles that may help:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK