block scrapers with anubis?

exodrifter · April 24, 2025, 5:55am

In the should t/suki registration be open to the public? thread, the topic of blocking LLM scrapers with Anubis came up. Since it was off-topic, I’ve created a topic here to discuss it.

If you aren’t familiar, Anubis is a relatively new tool which blocks scrapers by requiring connections to solve a proof of work, making it hard enough for most scrapers to access your website, preventing them from scraping all of your data. It’s successful and viral enough that it’s used by some pretty big names now, including UNESCO.

However, accessibility is a concern with deploying something like Anubis. The Anubis docs say this:

Anubis is a bit of a nuclear response. This will result in your website being blocked from smaller scrapers and may inhibit “good bots” like the Internet Archive. You can configure bot policy definitions to explicitly allowlist them and we are working on a curated set of “known good” bots to allow for a compromise between discoverability and uptime.

And when Xe Iaso (the developer of Anubis) first posted about it, this is what they had to say about accessibility:

This will also lock out users who have JavaScript disabled, prevent your server from being indexed in search engines, require users to have HTTP cookies enabled, and require users to spend time solving the proof-of-work challenge.

This does mean that users using text-only browsers or older machines where they are unable to update their browser will be locked out of services protected by Anubis. This is a tradeoff that I am not happy about, but it is the world we live in now.

I think I concur with Xe Iaso that, unfortunately, something like this is necessary despite the downsides. I’ve actually had problems with LLM scrapers on a self-hosted instance of gitea that I ran with a friend, before making git.tsuki.games. It took our service offline and we solved it just by restricting access to logged-in users only. However, that’s not something I currently plan on doing with forum.tsuki.games.

At the moment, I’m inclined to install Anubis for forum.tsuki.games soon, maybe in two weeks or so, but until then I think it’s worthwhile to have a conversation about it.

ZeikJT · April 24, 2025, 6:14am

We could definitely try it and see if it affects anyone

MyriadMinds · April 24, 2025, 1:27pm

i’m intrigued where this is going, but i’m not a fan of the technically wasted effort this solution requires. The quote about “wasting electricity to solve magical sudokus” comes to mind. Almost feels like we need some sort of identity system that can prove you’re not a scraper… but then again i don’t want to be tracked by some sort of identity system. Once again, the internet becomes a worse place…

Also, already looking forward to the chrome update that makes you upload your anubis cookies to google.

outfrost · April 24, 2025, 3:20pm

I can volunteer an ancient laptop with a Sandy Bridge i3 and a prehistoric one with a Celeron M 430 to test site accessibility :3

Regarding search engine indexing, do we know if any search bot user agents currently overlap with ai scrapers?

exodrifter · April 24, 2025, 5:53pm

if you visit https://anubis.techaro.lol/ you can see how well it works on your old machine.

as for search engine indexing, i’m not aware if any overlap with ai scrapers, but i wouldn’t be surprised if they did like in the case of google.

exodrifter · April 24, 2025, 5:58pm

i agree, it’s not great. one way to solve this problem is to make the forum require login before it can be accessed. but, as I had brought up in the other topic, that’s not something i think we want to do. if we want to keep our forum up for everyone to see except scrapers, we’ll have to have something which weighs the soul of your connection, as Xe put it.

isomorphism · April 24, 2025, 9:33pm

just tried it, works for me

halfcourtyeet · April 25, 2025, 4:49pm

I don’t think logging is the worst idea in the world.