How to block OpenAI from crawling your website

Not everyone was thrilled to learn that OpenAI, the creators of ChatGPT, had been training their AI on data taken from people’s websites without permission. While it’s too late to do anything about the data they’ve already crawled, you can stop these models from being trained on your current and future content — and all it takes is two lines of code.

However, just because you can block OpenAI from crawling your website, I would highly recommend asking the question if you should. For more on that, read this article: “Leaders: Don't prematurely block OpenAI from your websites.”

How ChatGPT crawls the web for content

OpenAI uses a web crawler called GPTBot to train their AI models (such as GPT-4). Web crawling is when an automated bot goes around collecting data on all the content on the internet. It happens all the time, and in fact, this is how Google works!

How to block GPTBot from crawling your site

The code below disallows GPTBot from accessing your site, and therefore stopping it from using your content for training purposes.

First, open your website’s Robots.txt file

If you’re not familiar with this concept, a robots.txt file lives at the root of your website. So, for www.pluralsight.com, it would live at www.pluralsight.com/robots.txt. This is a document that determines if web crawlers can crawl your website, and is always publicly accessible. For instance, if you wanted to stop Google from crawling something, you’d enter in:

When ChatGPT may crawl your website, regardless of your robots.txt file

Currently, it’s unclear if web browsing versions of ChatGPT (such as “Browse with Bing”) or ChatGPT plugins will be prevented by your robots.txt file. That’s because this isn’t necessarily going through GPTBot.

What GPTBot won’t crawl, regardless of your robots.txt file

According to OpenAI, web pages crawled by the bot are filtered to “remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates (their) policies.”

That said, it’s a gamble to rely on GPTBot to not crawl these things, so the safest bet would be to use the above methods (and maybe don’t have PII publicly searchable in the first place).

How can I tell if my website has already been crawled to train an AI?

OpenAI has been notoriously tight-lipped about what sites GPT-4, the current AI model behind ChatGPT, was trained on. For competitive reasons, OpenAI has said they will not share the details of the “architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

In short, there’s no way to tell if your website was crawled to train GPT-4, so all you can do is take the precautions listed above if you don’t want your website data crawled to train an AI model (or at least, the ones built by OpenAI).

Conclusion

From reading this article, you should have a solid understanding of how the Robots.txt file works, and how to put in an entry to block OpenAI’s bot from crawling it to train AI models.

Further learning about ChatGPT and AI

Worried about ChatGPT? Being informed is the best way to make measured decisions on how to handle AI use at your organization. There are a number of courses that Pluralsight offers that can help you learn the ins and outs of AI — you can sign up for a 10-day free trial with no commitments. Here are some you might want to check out:

If you’re wondering how to deal with your company’s usage of ChatGPT and similar products, here are some articles that may help:

How ChatGPT crawls the web for content

How to block GPTBot from crawling your site

First, open your website’s Robots.txt file

When ChatGPT may crawl your website, regardless of your robots.txt file

What GPTBot won’t crawl, regardless of your robots.txt file

How can I tell if my website has already been crawled to train an AI?

Conclusion

Further learning about ChatGPT and AI

Recommend

3 questions you'll definitely be asked in a tech interview | VentureBeat

How and why to create a brand

科普：保税仓、监管仓、普通仓有哪些特点和区别？

只有反人性才能成为更厉害的人

我的 20 年职业生涯：全是技术债

华铁通达科技发起成立高端装备产业投资基金，规模5亿元

US DoD AI chief on LLMs: 'I need hackers to tell us how this stuff breaks' | Ven...

6 reasons your GA4 reports aren't adding up

奥特维计划2.7亿元收购普乐新能源100%股权，成为全资子公司

cbloom rants: Float to int casts for data compression

About Joyk