Anubis is awesome! Stopping (AI)crawlbots
from sailorzoop@lemmy.librebun.com to selfhosted@lemmy.world on 12 Jul 17:51
https://lemmy.librebun.com/post/171140

Incoherent rant.

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.

So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”

From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.

My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.

I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.

Anubis Github, Anubis Website

Edit: Further elaboration for those who care, since I realized that might be important.

Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn’t due to crawlbots, but something else entirely.
I’ve just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show “unhealthy” and my instance couldn’t be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I’ve disabled anubis to check if that had anything to do with it, it didn’t. But, I’ve also noticed my host ulimit -n was set to like 1000… (I’ve been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn’t crashed yet. fingerscrossed
Boss, I’m tired and I want to get off Mr. Bones’ wild ride.
I’m very sorry for not being able to reply to you all, but it’s been hectic.

Cheers and I really hope someone finds this as useful as I did.

#selfhosted

threaded - newest

possiblylinux127@lemmy.zip on 12 Jul 18:08 next collapse

It doesn’t stop bots

All it does is make clients do as much or more work than the server which makes it less temping to hammer the web.

sailorzoop@lemmy.librebun.com on 12 Jul 18:12 collapse

Yeah, from what I understand it’s nothing crazy for any regular client, but really messes with the bots.
I don’t know, I’m just so glad and happy it works, it doesn’t mess with federation and it’s barely visible when accessing the sites.

possiblylinux127@lemmy.zip on 12 Jul 18:39 collapse

Personally my only real complaint is the lack of wasm. Outside if that it works fairly well.

e0qdk@reddthat.com on 12 Jul 18:14 next collapse

I don’t like Anubis because it requires me to enable JS – making me less secure. reddthat started using go-away recently as an alternative that doesn’t require JS when we were getting hammered by scrapers.

BakedCatboy@lemmy.ml on 12 Jul 18:32 next collapse

Fwiw Anubis is adding a nojs meta refresh challenge that if it doesn’t have issues will soon be the new default challenge

dan@upvote.au on 12 Jul 23:48 collapse

Won’t the bots just switch to using that instead of the heavier JS challenge?

Sekoia@lemmy.blahaj.zone on 12 Jul 23:56 collapse

They can, but it’s not trivial. The challenge uses a bunch of modern browser features that these scrapers don’t use, regarding metadata and compression and a few other things. Things that are annoying to implement and not worth the effort. Check the recent discussion on lobste.rs if you’re interested in the exact details.

yetAnotherUser@discuss.tchncs.de on 14 Jul 07:57 next collapse

Plus even if they were to implement those features, the challenges would still get increasingly harder the more bot-like a scraper behaves.

You can’t prevent scraping entirely but you can certainly prevent scraping that behaves like a DOS attack.

baod_rate@programming.dev on 14 Jul 12:03 collapse

Check the recent discussion on lobste.rs if you’re interested in the exact details.

For those coming from the future: lobste.rs/…/anubis_now_supports_non_js_challenges

Jumuta@sh.itjust.works on 13 Jul 02:52 collapse

iirc there’s instructions on completing the anubis challenge manually

danielquinn@lemmy.ca on 12 Jul 18:19 next collapse

I’ve been thinking about setting up Anubis to protect my blog from AI scrapers, but I’m not clear on whether this would also block search engines. It would, wouldn’t it?

sailorzoop@lemmy.librebun.com on 12 Jul 18:21 next collapse

I’m not entirely sure, but if you look here github.com/TecharoHQ/anubis/tree/main/data/bots
They have separate configs for each bot. github.com/TecharoHQ/anubis/…/botPolicies.json

RedBauble@sh.itjust.works on 12 Jul 22:32 collapse

You can setup the policies to allow search engines through, the default policy linked in the docs does that

danielquinn@lemmy.ca on 12 Jul 23:28 collapse

This all appears to be based on the user agent, so wouldn’t that mean that bad-faith scrapers could just declare themselves to be typical search engine user agent?

SheeEttin@lemmy.zip on 13 Jul 00:05 next collapse

Yes. There’s no real way to differentiate.

SorteKanin@feddit.dk on 13 Jul 08:48 collapse

Actually I think most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.

SorteKanin@feddit.dk on 13 Jul 08:50 collapse

Most search engine bots publish a list of verified IP addresses where they crawl from, so you could check the IP of a search bot against that to know.

ikidd@lemmy.world on 12 Jul 18:24 next collapse

Something that’s less annoying than Anubis is fail2ban tarpitting the scrapers by putting in a hidden honeypot page link that they follow, and adding the followers to fail2ban.

petermolnar.net/…/anti-ai-nepenthes-fail2ban/

N0x0n@lemmy.ml on 13 Jul 08:29 collapse

Wow, what a combo ! I guess this would reduce the tarpit’s overall power consumption?

I haven’t looked at your link yet and maybe it already contains my answer, but I wish to customize for how long they are traped into the tarpit before fail2ban kicks in so I can still poison their AI while saving alot of ressources !!

Edit:

block anything that visits it more, than X times with fail2ban

I guess this is it, but I’m not sure how that translates from nepenthese to fail2ban. Need further reading and testing !

Thanks for the link !

lambalicious@lemmy.sdf.org on 12 Jul 20:36 next collapse

Positives: nice uwu art.

Negatives: requires javascript, intrinsically ableist.

phase@lemmy.8th.world on 12 Jul 20:45 next collapse

There’s another challenge available, without javascript.

AmbitiousProcess@piefed.social on 12 Jul 23:42 next collapse

Could you elaborate on how it's ableist?

As far as I'm aware, not only are they making a version that doesn't even require JS, but the JS is only needed for the challenge itself, and the browser can then view the page(s) afterwards entirely without JS being necessary to parse the content in any way. Things like screen readers should still do perfectly fine at parsing content after the browser solves the challenge.

ohshit604@sh.itjust.works on 13 Jul 00:18 collapse

How is the art a positive?

lambalicious@lemmy.sdf.org on 13 Jul 15:09 collapse

What do you mean, how?

Cute anime catgirl, a staple of the internet, without having to be showy or anything. And there are hooks to change it.

(Was actually half-surprised they didn’t go with “anime!stereotypical egyptian priestess” given the context of the software, but I feel that would have ended up too thematically overloaded in the end)

Mora@pawb.social on 12 Jul 21:42 next collapse

Besides that point: why tf do they even crawl lemmy. They could just as well create a “read only” instance with an account that subscribes to all communities … and the other instances would send their data. Oh, right, AI has to be as unethical as possible for most companies for some reason.

ZombiFrancis@sh.itjust.works on 12 Jul 22:03 next collapse

See your brain went immediately to a solution based on knowing how something works. That’s not in the AI wheelhouse.

wizardbeard@lemmy.dbzer0.com on 12 Jul 22:04 next collapse

They crawl wikipedia too, and are adding significant extra load on their servers, even though Wikipedia has a regularly updated torrent to download all its content.

AmbitiousProcess@piefed.social on 12 Jul 23:36 next collapse

Because the easiest solution for them is a simple web scraper. If they don't give a shit about ethics, then something that just crawls every page it can find is loads easier for them to set up than a custom implementation to get torrent downloads for wikipedia, making lemmy/mastodon/pixelfed instances for the fediverse, using rss feeds and checking if they have full or only partial articles, implementing proper checks to prevent double (or more) downloading of the same content, etc.

dan@upvote.au on 12 Jul 23:49 collapse

They’re likely not intentionally crawling Lemmy. They’re probably just crawling all sites they can find.

TomAwezome@lemmy.world on 12 Jul 22:24 next collapse

Thanks for the “incoherent rant”, I’m setting some stuff up with Anubis and Caddy so hearing your story was very welcome :)

sic_semper_tyrannis@lemmy.today on 12 Jul 22:37 next collapse

Futo gave them a micro-grant this month

TheHobbyist@lemmy.zip on 12 Jul 22:58 next collapse

@demigodrick@lemmy.zip

Perhaps of interest? I don’t know how many bots you’re facing.

NotSteve_@piefed.ca on 12 Jul 23:07 next collapse

I love Anubis just because the dev is from my city that's never talked about (Ottawa)

SheeEttin@lemmy.zip on 13 Jul 00:04 collapse

Well not never, you’ve got the Senators.

Which will never not be funny to me since it’s Latin for “old men”.

NotSteve_@piefed.ca on 13 Jul 01:09 collapse

Hahaha I didn't know that but that is funny. Admittedly I'm not too big into hockey so I've got no gauge on how popular (edit: or unpopular 😅) the Sens are

rtxn@lemmy.world on 12 Jul 23:33 next collapse

But don’t you know that Anubis is MALWARE?

…according to some of the clowns at the FSF, which is definitely one of the opinions to have. www.fsf.org/…/our-small-team-vs-millions-of-bots

dan@upvote.au on 12 Jul 23:46 next collapse

tbh I kinda understand their viewpoint. Not saying I agree with it.

The Anubis JavaScript program’s calculations are the same kind of calculations done by crypto-currency mining programs. A program which does calculations that a user does not want done is a form of malware.

Natanox@discuss.tchncs.de on 12 Jul 23:54 next collapse

That’s guilt by association. Their viewpoint is awful.

I also wished there was no security at the gate of concerts, but I happily accept it if that means actual security (if done reasonably of course). And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas. Or the service literally dying due to AI DDoS.

Edit: Forgot to mention, proof of work wasn’t invented by or for crypto currency or blockchain. The concept exists since the 90’s (as an idea for Email Spam prevention), making their argument completely nonsensical.

Arghblarg@lemmy.ca on 13 Jul 02:17 next collapse

Ah, hashcash. Wish that had taken off, it was a good idea …

xavier666@lemmy.umucat.day on 14 Jul 10:49 collapse

TIL of hashcash

xavier666@lemmy.umucat.day on 14 Jul 10:48 collapse

And quite frankly, cute anime girl doing some math is so, so much better than those god damn freaking captchas

One user complained that a random anime girl popping up is making his gf think he’s watching hentai. So the mascot should be changed to something “normal”.

Natanox@discuss.tchncs.de on 14 Jul 12:24 collapse

Lol.

“My relationship is fragile and it’s the internets fault.”

interdimensionalmeme@lemmy.ml on 13 Jul 03:46 next collapse

Requiring client to runs client side code, if tolerated, will lead to the extinction of pure http clients. That in turn will enable to drm the whole web. I rather see it all burn.

Jayjader@jlai.lu on 14 Jul 15:32 collapse

Ok but if it allows anubis to judge the soul of my bytes as being worthy of reaching a certain site I’m trying to access, then the program is not making any calculations that I don’t want it to.

Would the FSF prefer the challenge page wait for user interaction before starting that proof of work? Along with giving them user a “don’t ask again” checkbox for future challenges?

chihuamaranian@lemmy.ca on 13 Jul 00:21 collapse

The FSF explanation of why they dislike Anubis could just as easily apply to the process of decrypting TLS/HTTPS. You know, something uncontroversial that every computer is expected to do when they want to communicate securely.

I don’t fundamentally see the difference between “The computer does math to ensure end-to-end privacy” and “The computer does math to mitigate DDoS attempts on the server”. Either way, without such protections the client/server relationship is lacking crucial fundamentals that many interactions depend on.

rtxn@lemmy.world on 13 Jul 00:44 next collapse

I’ve made that exact comparison before. TLS uses encryption; ransomware also uses encryption; by their logic, serving web content through HTTPS with no way to bypass it is a form of malware. The same goes for injecting their donation banner using an iframe.

SheeEttin@lemmy.zip on 13 Jul 14:12 collapse

Right. One of the facets of cryptography is rounds: if you apply the same algorithm 10,000 times instead of just one, it might make it slightly slower each time you need to run it, but it makes it vastly slower for someone trying to brute-force your password.

dan@upvote.au on 12 Jul 23:44 next collapse

The Anubis site thinks my phone is a bot :/

tbh I would have just configured a reasonable rate limit in Nginx and left it at that.

Won’t the bots just hammer the API instead now?

Flipper@feddit.org on 13 Jul 01:49 collapse

No. The rate limit doesn’t work as they use huge IP Spaces to crawl. Each IP alone is not bad they just use several thousand of them.

Using the API would assume some basic changes. We don’t do that here. If they wanted that, they could run their own instance and would even get notified about changes. No crawling required at all.

blob42@lemmy.ml on 13 Jul 01:05 next collapse

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

azertyfun@sh.itjust.works on 13 Jul 15:04 collapse

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

That’s an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that /8, but I very much doubt it’s the whole /8 which would be worth a lot; a /16 is apparently worth around $3-4M, so a /8 can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you’re probably matching a lot of innocent North American IPs with this.

blob42@lemmy.ml on 13 Jul 16:05 collapse

Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot

Cozog@feddit.dk on 13 Jul 20:35 collapse

When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.

Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).

blob42@lemmy.ml on 14 Jul 18:51 collapse

We need a decentralized community owned cloudflare alternative. Anubis looks on good track.

reddeadhead@awful.systems on 13 Jul 06:37 next collapse

Anubis just released the no-JS challenge in a update. Page loads for me with JS disabled. anubis.techaro.lol/blog/release/v1.20.0/

fireshell@kbin.earth on 13 Jul 08:31 next collapse

The development of Anubis remains a matter of enthusiasm: Zee is funding the project through Patreon and sponsorship on GitHub, but cannot yet afford to pursue it on a full-time basis. He would also like to hire a key community member, budget permitting.

SorteKanin@feddit.dk on 13 Jul 08:46 next collapse

I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing.

I’m just curious, how did you notice this in the first place? What are you monitoring to know and how do you present that information?

SorteKanin@feddit.dk on 13 Jul 08:53 next collapse

Also your avatar and the image posted here (not the thumbnail) seem broken - I wonder if that’s due to Anubis?

sailorzoop@lemmy.librebun.com on 13 Jul 18:20 collapse

Just updated the post again, yeah. But I think that was due to me changing nameservers for my domain at the time. Cheers.

fossilesque@mander.xyz on 13 Jul 12:38 next collapse

I’ve been planning on seeing this up for ages. Love the creators vibe. Thanks for this.

Charlxmagne@lemmy.world on 13 Jul 14:33 next collapse

Been seeing this on people’s invidious instances

MITM0@lemmy.world on 14 Jul 06:58 next collapse

Go_Away is another alternative

smashing3606@feddit.online on 14 Jul 18:41 next collapse

You don't by chance use traefik with this? I've figured out how to use it with docker on the same device, but can't figure out how to use it with external services.

paraphrand@lemmy.world on 14 Jul 19:10 next collapse

I’ve seen some people reject this solution due to the anime.

MichaelMuse@programming.dev on 15 Jul 04:16 collapse

I think AI can provide an interface to let user submit the site for crawling, such as some website scanner doing, like urlscan. Otherwise the site can reject the AI crawler.