Listen up, all you webmasters and blog owners out there. Let’s talk about the new guests attending your digital party: AI chatbots. Yes, they’re coming in uninvited, scraping your precious content and reusing it without asking for your permission. You’re probably wondering if there’s a way to stop these cyber gatecrashers in their tracks. Well, don’t you worry, we’ve got a method lined up for you. But beware, it does come with a side dish of caveats.
The Bots at Your Doorstep
We’re all aware of how AI chatbots are trained. They’re fed a smorgasbord of datasets, some of which are public and open-source. Take GPT3 for instance, which was trained using a mix of five different datasets according to OpenAI’s research paper:
- Common Crawl
- WebText2
- Books1
- Books2
- Wikipedia
Common Crawl is an extensive data collection from websites dating back to 2008, while WebText2 is OpenAI’s own creation, containing approximately 45 million web pages linked from Reddit posts with at least three upvotes. Now, in the case of ChatGPT, it’s not personally rifling through your web pages, but the recent launch of a ChatGPT-hosted web browser has raised eyebrows and concerns.
While we’re keeping a close eye on ChatGPT, there’s another player in the game that’s worth noting: Bard. Unlike the well-known Google’s search bots, Bard’s training methods remain a mystery, causing a bit of a stir among website owners.
Why Are Webmasters Breaking a Sweat?
The crux of the issue lies in the perception that AI bots like ChatGPT, Bard, and Bing Chat are undermining the value of original web content. These bots leverage existing content to generate responses, thereby eliminating the need for users to access the original source. This, in turn, reduces the traffic that the source website might have otherwise received.
Moreover, some AI chatbots, such as Bard, are notorious for not providing citations in their generative responses, which leaves website owners in the dark about where the information is sourced from. The net result is a system where generative AI tools exploit content creators’ work to replace them, which begs the question: what’s the incentive for website owners to continue publishing content?
Building a Cyber Fence: Blocking AI Bots
If you’re feeling protective about your web content, there’s a way to deter AI bots: by using the robots.txt file. The catch is, you have to specify each bot individually by name. For instance, to block Common Crawl’s bot (known as CCBot), you’d have to insert the following code into your robots.txt file:
User-agent: CCBot
Disallow: /
This will prevent Common Crawl from accessing your website in the future, but it won’t erase any data it has already gathered.
When it comes to ChatGPT, OpenAI has kindly provided instructions on how to block their bot, which goes by the name of ChatGPT-User. However, blocking search engine AI bots is a whole other ball game, and given how secretive Google is about their training data, it’s an uphill task to identify which bots you need to block.
How Foolproof Is This Method?
Let’s be clear: using the robots.txt file to block AI bots is the best we’ve got right now. But it’s far from foolproof. The glaring issue is that you need to identify each bot you want to block, which can be quite a challenge given the ever-growing number of AI bots being launched.
Also, instructions in the robots.txt file are merely suggestions that bots can choose to ignore. And even though bots like Common Crawl and ChatGPT are known to respect these commands, there’s a sizable number of bots out there that simply turn a blind eye.
Another limitation is that while you can block AI bots from future visits, you can’t erase the data from their past crawls, nor can you ask companies like OpenAI to wipe your data clean.
Is It Worth the Trouble?
Blocking all AI bots from your website isn’t a walk in the park. You’d have to block each bot individually, which is a daunting task in itself. Plus, there’s no guarantee that all of them will heed the commands in your robots.txt file. All things considered, the effort probably isn’t worth the outcome.
Now, there’s another side to this coin. By blocking AI bots from your website, you lose out on valuable data that could help you understand whether tools like Bard are a boon or a bane to your search marketing strategy. Yes, you could assume that the lack of citations is detrimental, but without data to back your assumptions, you’re basically shooting in the dark.
Remember when Google first introduced featured snippets to Search? For specific queries, Google displays a snippet of content from web pages on the results page, answering the user’s question right away. This means users don’t need to click through to a website to get the answer they’re looking for. The introduction of featured snippets caused quite a stir among website owners and SEO experts who depend on generating traffic from search queries.
However, the kind of queries that trigger featured snippets are generally low-value searches like “what is X” or “what’s the weather like in New York”. Anyone who wants in-depth information or a comprehensive weather report is still going to click through, and those who don’t were never all that valuable in the first place.
You might find it’s a similar story with generative AI tools, but you’ll need the data to prove it.
Pause Before You Leap
This is a time of uncertainty for website owners and publishers, who are understandably perturbed by the rise of AI technology and the idea of bots using their content to generate responses instantly. But let’s not rush into any drastic measures. AI technology is a rapidly evolving field, and we should take this opportunity to observe how things progress and analyze the potential threats and opportunities AI brings to the table.
It’s clear that the current system, which exploits content creators’ work to replace them, isn’t sustainable. Whether it’s tech giants like Google and OpenAI that change their approach or governments stepping in with new regulations, something’s got to give. As the negative implications of AI chatbots on content creation become more apparent, website owners and content creators can use these challenges to their advantage.