AI Crawler Bots

June 19, 2024

For the past few months, there has been increasing discontent with web crawler bots ingesting Internet sites.

The information is free and available to the public, so it may seem contradictory to disallow bots from using open data.

Ultimately, it's about taking someone's work and not giving anything in return. There are no links back. There are no ads viewed. There is nothing gained by the content creator.

There have been a few news agencies forcing tech companies to license their content. I would not be surprised if companies start putting more content behind paywalls or logins. Tech companies may face a RIAA vs. Napster like lawsuit eventually.

One way that website developers and content owners are "taking action" is disallowing AI web crawlers via robots.txt entries. Other websites are identifying user agents completely by inspecting user agents, looking at known IP addresses, and using captchas.

Most of the ways to stop web crawling rely on the honor system where the bot follows the robots.txt directives and declares itself via a known and unique user-agent header.

I wonder how far content creators will go to control their content. In a way, the past few years have really changed the perspective of "open ecosystems" such as open source software and the broader open web. Being open seemingly has limits.