OpenAI recently launched GPTBot, a web crawler that can directly access websites to extract information for training its AI models, including GPT-4 and future GPT-5, and ultimately enhance the overall AI ecosystem.
The bot can be identified by the following user agent token and the entire user agent string:
User-agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
Website owners have the flexibility to restrict GPTBot’s access by modifying the robots.txt file to include the following:
User-agent: GPTBot Disallow: /
Owners may also partially restrict the crawler’s access by customizing directories that GPTBot can access. As per OpenAI, this customization will exclude any sources restricted by a paywall, that violate OpenAI policies, or that gather personal information. This can be achieved by customizing the directories that GPTBot can access, by adding the following in the robots.txt file:
User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/
Many top brands, however, have already blocked GPTBot from accessing their site. Currently, 7% of the top 1,000 websites have blocked the GPTBot, with the percentage increasing daily. This blocking is due to brands aiming to protect their content from being used as an AI training model without proper compensation or credit.
Decision on blocking GPTBot largely depends on a website owner’s objectives and goals for their website. OpenAI has stated that it will cite sources when GPTBot pulls data from third-party websites. This potential citation presents an opportunity for increased visibility and clicks to a website’s original content. Blocking access to GPTBot could result in decreased visibility to a website as GPTBot could turn to competitors to obtain the necessary information.
E-commerce websites can enhance their visibility through ChatGPT. By adapting content to answer user inquiries, websites can boost engagement and deliver tailored recommendations. Blocking GPTBot’s access inhibits such interaction, potentially missing opportunities to reach prospective customers.
However, If a website features copyrighted content or concerns about sharing information inaccurately, GPTBot should be blocked to prevent content from being taken out of context and resulting in spread of misinformation.
Overall, it is up to each website owner to decide whether blocking GPTBot is in their website’s best interest.