OpenAI Unveils Its New Web Crawler: GPTBot
OpenAI, the organization behind the renowned ChatGPT, has recently disclosed details about its latest web crawler, aptly named GPTBot. This new development allows website owners to monitor if and to what extent OpenAI is crawling their sites. Moreover, it provides the option to restrict access to all or specific parts of their site using the robots.txt protocol.
The documentation for GPTBot can be found here. The user-agent token for this crawler is GPTBot, and the full user-agent string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot). This allows website owners to disallow GPTBot as they would any other crawler.
The current IP range for GPTBot is 40.83.2.64/28, but this is subject to change, so it’s recommended to check for updates regularly. OpenAI describes the purpose of GPTBot as a tool to potentially enhance future models. The web pages crawled by GPTBot are carefully filtered to exclude sources that require paywall access, gather personally identifiable information (PII), or contain text that violates OpenAI’s policies.
Allowing GPTBot to access your site can contribute to the accuracy and overall capabilities of AI models, as well as their safety. However, if you wish to disallow GPTBot from accessing your site, OpenAI provides instructions on how to do so.
There have been some complaints about GPTBot’s activity, with one webmaster reporting over 1000 hits from the bot on individual pages. The site automatically served a 403 for each hit as the bot was not whitelisted and failed the ‘human’ test.
Previously, it was only possible to block ChatGPT plugins. However, it appears that Google and others are developing an alternative to the robots.txt protocol specifically for AI search purposes.