The problem of the LLM crawlers

The founder of SourceHut, an open source platform to build software collaboratively –sometimes referred to as forge–, has written a post in his blog that shows the scale of the problems that bad crawlers feeding AIs are causing:

If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

I know other people suffering with those new crawlers that are essentially very aggressive –or in some cases likely broken–, don’t follow the rules, and try to conceal themselves as regular traffic. For example, read what Alex has been writing about trying to keep Emacs Wiki online –and a follow up, and another, and it won’t stop, etc–.

As Drew mentions on his post, you can find a lot of system administrators struggling with this, just because they are sharing publicly source code. As it sounds: because they have decided to distribute open source.

I have been self-hosting my repositories since 2023 and although it is not really a forge, I managed to make it small scale and work for me by providing a web interface on git.usebox.net. It has “about pages” –rendering the README see for example the SpaceBeans page–, and together with email and RSS feeds to track releases, it just works.

I noticed some issues last year in my server, but I attributed it to a mistake in the setup of cgit, honestly thinking that it was less performant than I was expecting and just configured a cache. It is supported by the tool, but being optional I thought I probably don’t need that.

Setting the cache seemed to fix the issue, and I didn’t investigate further. Until I few weeks ago that I was reading on Mastodon how someone was having a hard time dealing with these bots, and I realised that it was probably happening to me but I wasn’t paying attention. And that was the case!

Essentially it is what Drew is describing –if much smaller in my case–, and for me it seems to be always coming from the same IPs owned by Alibaba Cloud –that seem to have their own “Generative AI” product–. The day I checked the logs, I had over 200,000 requests in 24 hours coming from only two IPs.

My first impulse was to check with whois who was the owner of the IPs and, because they are owned by a cloud company, block on the firewall the whole range. And I kept monitoring the situation for a couple of days.

Of course they kept coming, so I kept blocking. At some point there was a few /16, /15, and even some /14 ranges in my block list. That was already 681,504 IPs, all owned by the same cloud company.

Because I have better things to do –really–, I wrote a small script that will ban IPs if they make what I consider “an abusive number of requests in 24 hours”, and keep the ban until they stop the abuse for 2 complete days. I don’t think this should affect legit users, but if you experience any issues, please contact me to justify why you need that volume of requests!

I did this on principle, because my forge is very small and I can handle the load. It wasn’t strictly necessary for me to block these bad actors, but I know people that couldn’t spend time with the problem and had to make all those open source repositories private; which is in my opinion the tragedy of all this: we are sharing code with the rest of the world, and the abuse of these companies trying to make profit on it is ruining it all for everybody.

I can’t help thinking about the paperclip maximizer.