4

OpenAI just admitted it has a bot that crawls the web to collect AI training dat...

 1 year ago
source link: https://finance.yahoo.com/news/openai-just-admitted-bot-crawls-211032730.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

OpenAI just admitted it has a bot that crawls the web to collect AI training data. If you don't block GPTbot, that's self-sabotage.

Alistair Barr
Wed, August 9, 2023, 6:10 AM GMT+9·6 min read
Huntsman spider.
Huntsman spider.Amith Nag Photography/Getty Images
  • Spiderbots have been crawling the web for years collecting data.

  • Some of these bots have been helpful because they send users to sources of original content online.

  • The rise of generative AI and LLMs is undermining this grand internet bargain.

I hate spiders. When I traveled around the world in 2003, the thought of chunky, hairy arachnids creeping beneath my mosquito net kept me awake on many a tropical night.

Unbeknownst to most people, there are digital spiders crawling all over the websites you read and create. The most active one is probably Googlebot, which automatically collects web information so Google can later rank and serve it up in Search results.

Right now, there are several of these spiderbots crawling all over these words I wrote here, which is kinda creepy.

Some of these digital crawlers have also been incredibly helpful. Take the book I wrote about my travels in 2003. When Google's bot crawls my book webpage, I'm happy because when people later search for travel books they might be sent to my book. Maybe they'll buy it and read it.

This is the grand bargain that has made the internet economy thrive: Google scrapes your content and sends you traffic so you have an incentive to keep posting information online.

AI is undermining the grand web bargain

Now the rise of generative AI and large language models is undermining this deal. OpenAI recently admitted that it has one of these spiders crawling around the web. It's called GPTbot and it's being used to scrape and collect online content for AI model training. The next big model, GPT-5, will likely be trained on the data scooped up by this bot.

GPT-4, ChatGPT, and other powerful models cleverly answer questions immediately, so there's less need to send users to the sources of the original information. This may be a great user experience, but the incentives to share high-quality free information online begin to break down pretty quickly.

Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site? You can already see this in action as fewer people visit Stack Overflow to get software coding help.

Recommended Stories

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK