Scraping the web for AI development is theft

If scraping the web is the only way to develop Large Language Models then LLM-driven AI is not worth building

Whenever a service offers a new “feature” with an opt-out option rather than requiring users to opt in, it is worth a second look. This is a common red flag that pops up when a company either knows what they are doing lies in a moral grey area or suspects user would rather opt out than in. Facebook, for example, shows its users “highly targeted ads” but, knowing not everyone might want to be tracked so psychopathically, offers users the option to opt out of such advertising. This is sort of like a murder seeking your permission to take your life after stabbing you half-a-dozen times. It is ridiculous, outrageous and despicable tokenism. Or, as Hidde de Vries put it more decently than I care to, it is rude.

OpenAI has argued in the past that what they are doing is for the good of humanity. That is a wild, vague and opaque reason for anything at all. There is a reason “for the good of humanity” does not curry favours at fundraisers: people can see through it. To argue that your for-profit company needs complete rights to content produced by everyone everywhere because you can profit from it do good to humankind with it is an outrageous demand.

The New York Times sued OpenAI for this as I wrote in my margin notes last month. This is a step in the right direction and I hope others follow. As I said back then—

This makes me happy. What OpenAI has done is theft. There is a reason you cannot take an NYT article—or one from my own website if you wish—and do something with it and profit from it commercially. People would nod in agreement when you describe it this way because we all understand it amounts to theft. Yet this is precisely what OpenAI has done and is continuing to defend its actions in the name of humanity.

Since then OpenAI has come up with a solution for people who would rather their websites not condone theft: politely opt out of web scraping by telling GPTBot (the crawler that gather website data) to keep off your site.

This is the red flag I spoke of earlier. What OpenAI should have done is put up a form on their website where website owners could opt into being crawled by GPTBot (or “plagiarism bot” as Mike Morris calls them) in the name of “contributing content for crawling and … doing good to humankind” or some such marketingspeak. But why would any for-profit company do that? Fewer people would opt in that way than would ever opt out via a line in their robots.txt file. And fewer sites to steal from means lesser profits.

The real question here—if we were to forget temporarily everything AI can ruin and only credit it for all it can do—is what the price is that is worth paying to have Artificial Intelligence that is powered by a Large Language Model. Is it worth allowing theft? Is it worth allowing an invasion of privacy?

Calling out OpenAI alone would be amiss on my part. Google with its much less impressive but similarly built Bard AI has also argued for the same thing calling it “a principled approach”. Hitler also believed that his hatred for jews was justified by an “aristocratic principle of Nature” in Mein Kampf; the point being that justifications know no ends when ulterior intentions are invovled. For both OpenAi and Google—and like them for every other for-profit AI company currently under operation—the development of AI is only a means to a profitable end.

Google has itself come close to such thievery before but stopped short. Perhaps an OpenAI did not exist to embolden it back then. I have previously written about my displeasure with Google AMP and others have fought Google over infringement (and Google has fought back) in the search engine’s use of headlines and cached pages in search results. Google’s attempts have all had one thing in common: taking control of both the content and the reading experience to keep people on its platforms so it can profit more easily. Scrubbing the web for AI takes that much further, with Google (or OpenAI) snatching away any content it needs.

After religion, the nature of our advances in AI are mankind’s greatest mistake. I am not against religion itself and prefer to keep it strictly private; I am likewise not against artificial intelligence either so long as we first create a moral and legal framework, then develop AI. Until then AI will remain a lawless land and the likes of OpenAI and Google Bard will remain thieving wolves in sheep’s clothing.

Tip If you want to opt out of GPTBot, simply add the following two lines to your robots.txt file, usually located at the root of your website’s public-facing folder:

User-agent: GPTBot
Disallow: /

If you need more help, Artefacto has a handy guide.


Liked this essay?

It takes time and effort to keep up good quality, independent writing. If you liked what you read, please consider supporting this website. I’m always open to discussions via e-mail or iMessage and several readers get in touch this way.

Subscribe to my newsletter

Confluence is a newsletter on science, technology and society, designed to make you think critically about your world. Dispatched fortnightly.

Five reasons to subscribe