Why are anime catgirls blocking my access to the Linux kernel?

Why are anime catgirls blocking my access to the Linux kernel? (lock.cmpxchg8b.com)
from tofu@lemmy.nocturnal.garden to selfhosted@lemmy.world on 21 Aug 09:02
https://lemmy.nocturnal.garden/post/194665

Some thoughts on how useful Anubis really is. Combined with comments I read elsewhere about scrapers starting to solve the challenges, I’m afraid Anubis will be outdated soon and we need something else.

#selfhosted

threaded - newest

ryannathans@aussie.zone on 21 Aug 09:26 next collapse

Yeah has seemed like a bit of a waste of time, once that difficulty gets scaled up and expiration down it’s gonna get annoying to use the web on phones

non_burglar@lemmy.world on 21 Aug 13:28 collapse

I had to get my glasses to re-read this comment.

You know why anubis is in place on so many sites, right? You are literally blaming the victims for the absolute bullshit AI is foisting on us all.

billwashere@lemmy.world on 21 Aug 14:20 next collapse

I don’t think so. I think he’s blaming the “solution” as being a stop gap at best and painful for end-users at worst. Yes the AI crawlers have caused the issue but I’m not sure this is a great final solution.

As the article discussed, this is essentially “an expensive“ math problem meant to deter AI crawlers but in the end it ain’t really that expensive. It’s more like they put two door handles on a door hoping the bots are too lazy to turn both of them but also severely slowing down all one-handed people. I’m not sure it will ever be feasible to essentially figure out how to have one bot determine if the other end is also a bot without human interaction.

ryannathans@aussie.zone on 21 Aug 22:41 collapse

It works because it’s a bit of obscurity, not because it’s expensive. Once it’s a big enough problem to the scrapers, the scrapers will adapt and then the only option is to make it more obscure/different or crank up the difficulty which will slow down genuine users much more

ryannathans@aussie.zone on 21 Aug 22:44 collapse

Yes, I manage cloudflare for a massive site that at times gets hit with millions of unique bot visits per hour

non_burglar@lemmy.world on 22 Aug 01:38 collapse

So you know that this is the lesser of the two evils? Seems like you’re viewing it from client’s perspective only.

No one wants to burden clients with Anubis, and Anubis shouldn’t exist. We are all (server operators and users) stuck with this solution for now because there is nothing else at the moment that keeps these scrapers at bay.

Even the author of Anubis doesn’t like the way it works. We all know it’s just more wasted computing for no reason except big tech doesn’t give a care about anyone.

ryannathans@aussie.zone on 22 Aug 03:40 collapse

My point is, and the author’s point is, it’s not computation that’s keeping the bots away right now. It’s the obscurity and challenge itself getting in the way.

rtxn@lemmy.world on 21 Aug 09:36 next collapse

The current version of Anubis was made as a quick “good enough” solution to an emergency. The article is very enthusiastic about explaining why it shouldn’t work, but completely glosses over the fact that it has worked, at least to an extent where deploying it and maybe inconveniencing some users is preferable to having the entire web server choked out by a flood of indiscriminate scraper requests.

The purpose is to reduce the flood to a manageable level, not to block every single scraper request.

poVoq@slrpnk.net on 21 Aug 09:54 next collapse

And it was/is for sure the lesser evil compared to what most others did: put the site behind Cloudflare.

I feel people that complain about Anubis have never had their server overheat and shut down on an almost daily basis because of AI scrapers 🤦

tofu@lemmy.nocturnal.garden on 21 Aug 10:46 next collapse

Yeah, I’m just wondering what’s going to follow. I just hope everything isn’t going to need to go behind an authwall.

rtxn@lemmy.world on 21 Aug 11:01 collapse

The developer is working on upgrades and better tools. xeiaso.net/…/avoiding-becoming-peg-dependency/

tofu@lemmy.nocturnal.garden on 21 Aug 11:57 next collapse

Cool, thanks for posting! Also the reasoning for the image is cool.

grysbok@lemmy.sdf.org on 21 Aug 14:47 collapse

I’ll say the developer is also very responsive. They’re (ambiguous ‘they’, not sure of pronouns) active in a libraries-fighting-bots slack channel I’m on. Libraries have been hit hard by the bots: we have hoards of tasty archives and we don’t have money to throw resources at the problem.

lilith267@lemmy.blahaj.zone on 21 Aug 18:08 collapse

The Anubis repo has an enbyware emblem fun fact :D

grysbok@lemmy.sdf.org on 21 Aug 18:12 collapse

Yay! I won’t edit my comment (so your comment will make sense) but I checked and they also list they/them on their github profile

interdimensionalmeme@lemmy.ml on 21 Aug 14:59 next collapse

Unless you have a dirty heatsink, no amount of hammering would make the server overheat

poVoq@slrpnk.net on 21 Aug 15:13 collapse

Are you explaining my own server to me? 🙄

interdimensionalmeme@lemmy.ml on 21 Aug 15:30 collapse

What CPU do you have made after 2004 that doesn’t have automatic temperature control ?
I don’t think there is any, unless you somehow managed to disable it ?
Even a raspberry pi without a heatsink won’t overheat to shutdown

poVoq@slrpnk.net on 21 Aug 15:38 collapse

You are right, it is actually worse, it usually just overloads the CPU so badly that it starts to throttle and then I can’t even access the server via SSH anymore. But sometimes it also crashes the server so that it reboots, and yes that can happen on modern CPUs as well.

interdimensionalmeme@lemmy.ml on 21 Aug 15:59 collapse

You need to set you http serving process to a priority below the administrative processes (in the place where you are starting it, so assuming linux server that would be your init script or systemd service unit).

Actual crash causing reboot ? Do you have faulty ram maybe ? That’s really not ever supposed to happen from anything happenning in userland. That’s not AI, your stuff might be straight up broken.

Only thing that isn’t broken that could reboot a server is a watchdog timer.

You server shouldn’t crash, reboot or become unreachable from the admin interface even at 100% load and it shouldn’t overheat either, temperatures should never exceed 80C no matter what you do, it’s supposed to be impossible with thermal management, which all processors have had for decades.

poVoq@slrpnk.net on 21 Aug 18:25 collapse

Great that this is all theoretical 🤷 My server hardware might not be the newest but it is definitly not broken.

And besides, what good is that you can still barely access the server through ssh, when the cpu is constantly maxed out and site visitors only get a timeout when trying to access the services?

I don’t even get what you are trying to argue here. That the AI scraper DDOS isn’t so bad because in theory it shouldn’t crash the server? Are you even reading what you are writing yourself? 🤡

interdimensionalmeme@lemmy.ml on 21 Aug 19:04 next collapse

Even if your server is a cell phone from 2015, if it’s operating correctly and the cpu is maxed out, that means it’s fully utilized and services hundreds of megabits of information.

You’ve decided to let the entire world read from your server, that indiscriminatory policy is letting people you don’t want getting your data, get your data and use your resources.

You want to correct that by making everyone that comes in solve a puzzle, therefore in some way degrading their access, it’s not surprising that they’re going to complain. The other day I had to wait over 30 second at an anubis puzzle page, when I know that the AI scrappers have no problem getting through, something on my computer, probably some anti-crypto mining protection is getting triggered by it and now I can’t no-script the web either because of that thing and it can’t even stop scrappers anyway !

So, anubis is going to be left behind, all the real users are, for years, going to be annoyed and have their entire internet degraded by it while the scrappers got that institutionally figured out in days.

If it’s freely available public data then the solution isn’t restricting access trying to play a futile arms race with the scrapper and throwing the real users to the dogs, it’s to have standardized incremental efficient database dumps so the scrappers stop assuming every website is interoperability-hostile and scrape them. Let facebook and xitter fight the scrappers, let anyone trying to leverage public (and especially user contributed data) fight the scrappers.

poVoq@slrpnk.net on 21 Aug 19:21 next collapse

Aha, an apologist for AI scraper DDOS, why didn’t you say so directly instead of wasting my time?

interdimensionalmeme@lemmy.ml on 21 Aug 21:11 collapse

The ddos is caused by the gatekeeping, there was no such issue before the 2023 API wars, fork over the goods and nobody gets hurt, it’s not complicated, you want to publish information to the public, don’t scrunch it up behind diseased trackers and ad infested pages which burn you cpu cycles. Or just put it in a big tarball torrent, the web is turning into a cesspool, how long until our browsers don’t even query websites at all but self-hosted crawler and search like searxng, at least then I won’t be catching cooties from your javascript cryptomining bots embed into the pages !

Deathray5@lemmynsfw.com on 22 Aug 00:40 collapse

“fork over the goods and nobody gets hurt” mate you are not sounding like the good person here

tofu@lemmy.nocturnal.garden on 21 Aug 19:46 collapse

Even if one would want to give them everything, they don’t care. They just burn through their resources and recursively scrape every single link on your page. Providing standardized database dumps is absolutely not helping against your server being overloaded by scrapers of various companies with deep pockets.

interdimensionalmeme@lemmy.ml on 21 Aug 21:16 collapse

Like anubis, that’s not going to last, the point isn’t to hammer the web servers off the net, it’s to get the precious data. The more standardized and streamlined that’s going to be made and only if there’s no preferential treatment to certain players (open ai / google facebook) then the dumb scraper will burn themselves out.

One nice thing about anubis and nepenthes is that it’s going to burn out those dumb scrapers faster and force them to become more sophisticated and stealth. That’s should resolve the ddos problem on its own.

For the truly public data sources, I think coordinated database dumps is the way to go, for hostile carrier, like reddit and facebook, it’s going to be scrapper arms race warfare like Cory Doctorow predicted.

daniskarma@lemmy.dbzer0.com on 22 Aug 06:45 collapse

Why the hell don’t you limit the CPU usage of that service?

For any service that could hog resources so bad that they can block the entire system the normal thing to do is to limit their max resource usage. This is trivial to do using containers. I do it constantly for leaky software.

poVoq@slrpnk.net on 22 Aug 08:22 collapse

Obviously I did that, but that just means the site becomes inaccessible even sooner.

mobotsar@sh.itjust.works on 21 Aug 16:26 next collapse

Is there a reason other than avoiding infrastructure centralization not to put a web server behind cloudflare?

bjoern_tantau@swg-empire.de on 21 Aug 17:14 next collapse

Cloudflare would need https keys so they could read all the content you worked so hard to encrypt. If I wanted to do bad shit I would apply at Cloudflare.

mobotsar@sh.itjust.works on 21 Aug 17:38 collapse

Maybe I’m misunderstanding what “behind cloudflare” means in this context, but I have a couple of my sites proxied through cloudflare, and they definitely don’t have my keys.

I wouldn’t think using a cloudflare captcha would require such a thing either.

bjoern_tantau@swg-empire.de on 21 Aug 18:30 next collapse

Hmm, I should look up how that works.

Edit: developers.cloudflare.com/ssl/…/ssl-modes/#custom…

They don’t need your keys because they have their own CA. No way I’d use them.

Edit 2: And with their own DNS they could easily route any address through their own servers if they wanted to, without anyone noticing. They are entirely too powerful. Is there some way to prevent this?

starkzarn@infosec.pub on 21 Aug 19:26 collapse

That’s because they just terminate TLS at their end. Your DNS record is “poisoned” by the orange cloud and their infrastructure answers for you. They happen to have a trusted root CA so they just present one of their own certificates with a SAN that matches your domain and your browser trusts it. Bingo, TLS termination at CF servers. They have it in cleartext then and just re-encrypt it with your origin server if you enforce TLS, but at that point it’s meaningless.

mobotsar@sh.itjust.works on 21 Aug 22:17 collapse

Oh, I didn’t think about the fact that they’re a CA. That’s a good point; thanks for the info.

poVoq@slrpnk.net on 21 Aug 18:35 collapse

Yes, because Cloudflare routinely blocks entire IP ranges and puts people into endless captcha loops. And it snoops on all traffic and collects a lot of metadata about all your site visitors. And if you let them terminate TLS they will even analyse the passwords that people use to log into the services you run. It’s basically a huge survelliance dragnet and probably a front for the NSA.

daniskarma@lemmy.dbzer0.com on 22 Aug 06:42 next collapse

I still think captchas are a better solution.

In order to surpass them they have to run AI inference which is also comes with compute costs. But for legitimate users you don’t run unauthorized intensive tasks on their hardware.

poVoq@slrpnk.net on 22 Aug 08:28 collapse

They are much worse for accessibility, and also take longer to solve and are more distruptive for the majority of users.

daniskarma@lemmy.dbzer0.com on 22 Aug 08:37 collapse

Anubis is worse for privacy. As you have to have JavaScript enabled. And worse for the environment as the cryptographic challenges with PoW are just a waste.

Also reCaptcha types are not really that disturbing most of the time.

As I said, the polite thing you just be giving users the options. Anubis PoW running directly just for entering a website is one of the most rudest piece of software I’ve seen lately. They should be more polite, and just give an option to the user, maybe the user could chose to solve a captcha or run Anubis PoW, or even just having Anubis but after a button the user could click.

I don’t think is good practice to run that type of software just for entering a website. If that tendency were to grow browsers would need to adapt and straight up block that behavior. Like only allow access to some client resources after an user action.

poVoq@slrpnk.net on 22 Aug 08:42 collapse

Are you seriously complaining about an (entirely false) negative privacy aspect of Anubis and then suggest reCaptcha from Google is better?

Look, no one thinks Anubis is great, but often it is that or the website becoming entirely inaccessible because it is DDOSed to death by the AI scrapers.

daniskarma@lemmy.dbzer0.com on 22 Aug 08:50 collapse

First, I said reCaptcha types, meaning captchas of the style of reCaptcha. That could be implemented outside a google environment. Secondly, I never said that types were better for privacy. I just said Anubis is bad for privacy. Traditional captchas that work without JavaScript would be the privacy friendly way.

Third, it’s not a false proposition. Disabling JavaScript can protect your privacy a great deal. A lot of tracking is done through JavaScript.

Last, that’s just the Anubis PR slogan. Not the truth, as I said ddos mitigation could be implemented in other ways. More polite and/or environmental friendly.

Are you astrosurfing for anubis? Because I really cannot understand why something as simple as a landing page with a button “run PoW challenge” would be that bad

poVoq@slrpnk.net on 22 Aug 09:00 collapse

Anubis is not bad for privacy, but rather the opposite. Server admins explicitly chose it over commonly available alternatives to preserve the privacy of their visitors.

If you don’t like random Javascript execution, just install an allow-list extension in your browser 🤷

And no, it is not a PR slogan, it is the live experience of thousands of server admins (me included) that have been fighting with this for month now and are very grateful that Anubis has provided some (likely only temporary) relief from that.

And I don’t get what the point of an extra button would be when the result is exactly the same 🤷

grysbok@lemmy.sdf.org on 22 Aug 10:03 collapse

Latest version of Anubis has a JavaScript-free verification system. It isn’t as accurate, so I allow js-free visits only if the site isn’t being hammered. Which, tbf, prior to Anubis no one was getting in, JS or no JS.

moseschrute@crust.piefed.social on 22 Aug 23:09 collapse

Out of curiosity, what’s the issue with Cloudflair? Aside from the constant worry they may strong arm you into their enterprise pricing if you’re site is too popular lol. I understand support open source, but why not let companies handle the expensive bits as long as they’re willing?

I guess I can answer my own question. If the point of the Fediverse is to remove a single point of failure, then I suppose Cloidflare could become a single point to take down the network. Still, we could always pivot away from those types of services later, right?

Limonene@lemmy.world on 23 Aug 00:27 collapse

Cloudflare has IP banned me before for no reason (no proxy, no VPN, residential ISP with no bot traffic). They’ve switched their captcha system a few times, and some years it’s easy, some years it’s impossible.

AnUnusualRelic@lemmy.world on 21 Aug 15:38 next collapse

The problem is that the purpose of Anubis was to make crawling more computationally expensive and that crawlers are apparently increasingly prepared to accept that additional cost. One option would be to pile some required cycles on top of what’s currently asked, but it’s a balancing act before it starts to really be an annoyance for the meat popsicle users.

rtxn@lemmy.world on 21 Aug 18:07 collapse

That’s why the developer is working on a better detection mechanism. xeiaso.net/…/avoiding-becoming-peg-dependency/

0_o7@lemmy.dbzer0.com on 22 Aug 05:36 next collapse

The article is very enthusiastic about explaining why it shouldn’t work, but completely glosses over the fact that it has worked

This post was originally written for ycombinator “Hacker” News which is vehemently against people hacking things together for greater good, and more importantly for free.

It’s more of a corporate PR release site and if you aren’t known by the “community”, calling out solutions they can’t profit off of brings all the tech-bros to the yard for engagement.

loudwhisper@infosec.pub on 22 Aug 07:15 collapse

Exactly my thoughts too. Lots of theory about why it won’t work, but not looking at the fact that if people use it, maybe it does work, and when it won’t work, they will stop using it.

mfed1122@discuss.tchncs.de on 21 Aug 09:37 next collapse

Yeah, well-written stuff. I think Anubis will come and go. This beautifully demonstrates and, best of all, quantifies the ~~negligence~~ negligible cost to scrapers of Anubis.

It’s very interesting to try to think of what would work, even conceptually. Some sort of purely client-side captcha type of thing perhaps. I keep thinking about it in half-assed ways for minutes at a time.

Maybe something that scrambles the characters of the site according to some random “offset” of some sort, e.g maybe randomly selecting a modulus size and an offset to cycle them, or even just a good ol’ cipher. And the “captcha” consists of a slider that adjusts the offset. You as the viewer know it’s solved when the text becomes something sensical - so there’s no need for the client code to store a readable key that could be used to auto-undo the scrambling. You could maybe even have some values of the slider randomly chosen to produce English text if the scrapers got smart enough to check for legibility (not sure how to hide which slider positions would be these red herring ones though) - which could maybe be enough to trick the scraper into picking up junk text sometimes.

drkt@scribe.disroot.org on 21 Aug 10:23 next collapse

That type of captcha already exists. I don’t know about their specific implementation, but 4chan has it, and it is trivially bypassed by userscripts.

GuillaumeRossolini@infosec.exchange on 21 Aug 10:35 next collapse

@mfed1122 @tofu any client-side tech to avoid (some of the) bots is bound to, as its popularity grows, be either circumvented by the bot’s developers or the model behind the bot will have picked up enough to solve it

I don’t see how any of these are going to do better than a short term patch

rtxn@lemmy.world on 21 Aug 12:20 next collapse

That’s the great thing about Anubis: it’s not client-side. Not entirely anyways. Similar to public key encryption schemes, it exploits the computational complexity of certain functions to solve the challenge. It can’t just say “solved, let me through” because the client has to calculate a number, based on the parameters of the challenge, that fits certain mathematical criteria, and then present it to the server. That’s the “proof of work” component.

A challenge could be something like “find the two prime factors of the semiprime 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139”. This number is known as RSA-100, it was first factorized in 1991, which took several days of CPU time, but checking the result is trivial since it’s just integer multiplication. A similar semiprime of 260 decimal digits still hasn’t been factorized to this day. You can’t get around mathematics, no matter how advanced your AI model is.

GuillaumeRossolini@infosec.exchange on 21 Aug 12:48 collapse

@rtxn I don’t understand how that isn’t client side?

Anything that is client side can be, if not spoofed, then at least delegated to a sub process, and my argument stands

rtxn@lemmy.world on 21 Aug 12:58 next collapse

It’s not client-side because validation happens on the server side. The content won’t be displayed until and unless the server receives a valid response, and the challenge is formulated in such a way that calculating a valid answer will always take a long time. It can’t be spoofed because the server will know that the answer is bullshit. In my example, the server will know that the prime factors returned by the client are wrong because their product won’t be equal to the original semiprime. Delegating to a sub-process won’t work either, because what’s the parent process supposed to do? Move on to another piece of content that is also protected by Anubis?

The point is to waste the client’s time and thus reduce the number of requests the server has to handle, not to prevent scraping altogether.

GuillaumeRossolini@infosec.exchange on 21 Aug 14:10 collapse

@rtxn validation of what?

This is a typical network thing: client asks for resource, server says here’s a challenge, client responds or doesn’t, has the correct response or not, but has the challenge regardless

rtxn@lemmy.world on 21 Aug 15:27 collapse

THEN (and this is the part you don’t seem to understand) the client process has to waste time solving the challenge, which is, by the way, orders of magnitudes lighter on the server than serving the actual meaningful content, or cancel the request. If a new request is sent during that time, it will still have to waste time solving the challenge. The scraper will get through eventually, but the challenge delays the response and reduces the load on the server because while the scrapers are busy computing, it doesn’t have to serve meaningful content to them.

GuillaumeRossolini@infosec.exchange on 21 Aug 15:47 collapse

@rtxn all right, that’s all you had to say initially, rather than try convincing me that the network client was out of the loop: it isn’t, that’s the whole point of Anubis

rtxn@lemmy.world on 21 Aug 18:05 collapse

With how much authority you wrote with before, I thought you’d be able to grasp the concept. I’m sorry I assumed better.

Passerby6497@lemmy.world on 21 Aug 13:16 collapse

Please, explain to us how you expect to spoof a math problem that you have to provide an answer to the server before proceeding.

You can math all you want on the client, but the server isn’t going to give you shit until you provide the right answer.

GuillaumeRossolini@infosec.exchange on 21 Aug 14:09 collapse

@Passerby6497 I really don’t understand the issue here

If there is a challenge to solve, then the server has provided that to the client

There is no way around this, is there?

Passerby6497@lemmy.world on 21 Aug 14:16 collapse

You’re given the challenge to solve by the server, yes. But just because the challenge is provided to you, that doesn’t mean you can fake your way through it.

You still have to calculate the answer before you can get any farther. You can’t bullshit/spoof your way through the math problem to bypass it, because your correct answer is required to proceed.

There is no way around this, is there?

Unless the server gives you a well-known problem you have the answer to/is easily calculated, or you find a vulnerability in something like Anubis to make it accept a wrong answer, not really. You’re stuck at the interstitial page with a math prompt until you solve it.

Unless I’m misunderstanding your position, I’m not sure what the disconnect is. The original question was about spoofing the challenge client side, but you can’t really spoof the answer to a complicated math problem unless there’s an issue with the server side validation.

GuillaumeRossolini@infosec.exchange on 21 Aug 14:19 collapse

@Passerby6497 my stance is that the LLM might recognize that the best way to solve the problem is to run chromium and get the answer from there, then pass it on?

Badabinski@kbin.earth on 21 Aug 14:28 next collapse

Anubis has worked if that's happening. The point is to make it computationally expensive to access a webpage, because that's a natural rate limiter. It kinda sounds like it needs to be made more computationally expensive, however.

zalgotext@sh.itjust.works on 21 Aug 15:17 next collapse

LLMs can’t just run chromium unless they’re tool aware and have an agent running alongside them to facilitate tool use. I highly suspect that AI web crawlers aren’t that sophisticated.

Passerby6497@lemmy.world on 21 Aug 19:08 next collapse

Congrats on doing it the way the website owner wants! You’re now into the content, and you had to waste seconds of processing power to do so (effectively being throttled by the owner), so everyone is happy. You can’t overload the site, but you can still get there after a short wait.

GuillaumeRossolini@infosec.exchange on 21 Aug 19:11 collapse

@Passerby6497 yes I’ve been told as much 😅

https://lemmy.world/comment/18919678

Jokes aside, I understand this was the point. I just wanted to make the point that it is feasible, if not currently economically viable

dabe@lemmy.zip on 22 Aug 04:06 collapse

That solution still introduces lots of friction. At the volume and rate that these bots want to be traversing the internet, they probably don’t want to be fully graphically rendering pages and spawning extra browser processes then doing text recognition to then pass on to the LLM training sets. Maybe I’m wrong there, but I don’t think it’s that simple and actually just shifts solving the math challenge horizontally (i.e., in both cases, the scraper or the network the scraper is running on still has to solve the challenge)

mfed1122@discuss.tchncs.de on 22 Aug 22:12 collapse

Yeah, you’re absolutely right and I agree. So then do we have to resign the situation to being an eternal back-and-forth of just developing random new challenges every time the scrapers adapt to them? Like antibiotics for viruses? Maybe that is the way it is. And honestly that’s what I suspect. But Anubis feels so clever and so close to something that would work. The concept of making it about a cost that adds up, so that it intrinsically only effects massive processes significantly, is really smart…since it’s not about coming up with a challenge a computer can’t complete, but just a challenge that makes it economically not worth it to complete. But it’s disappointing to see that, at least with the current wait times, it doesn’t seem like it will cost enough to dissuade scrapers. And worse, the cost is so low that it seems like making the cost significant to the scrapers will require really insufferable wait times for users.

JadedBlueEyes@programming.dev on 21 Aug 13:31 next collapse

That kind of captcha is trivial to bypass via frequency analysis. Text that looks like language, as opposed to random noise, is very statistically recognisable.

possiblylinux127@lemmy.zip on 22 Aug 06:01 collapse

Not to mention it relies on security though obscurity

It wouldn’t be that hard to figure out and bypass

dabe@lemmy.zip on 22 Aug 04:03 next collapse

I’m sure you meant to sound more analytical than anything… but this really comes off as arrogant.

You make the claim that Anubis is negligent and come and go, and then admit ton only spending minutes at a time thinking of solutions yourself, which you then just sorta spout. It’s fun to think about solutions to this problem collectively, but can you honestly believe that Anubis is negligent when it’s so clearly working and when the author has been so extremely clear about their own perception of its pitfalls and hasty development (go read their blog, it’s a fun time).

mfed1122@discuss.tchncs.de on 22 Aug 21:54 collapse

By negligence, I meant that the cost is negligible to the companies running scrapers, not that the solution itself is negligent. I should have said “negligibility” of Anubis, sorry - that was poor clarity on my part.

But I do think that the cost of it is indeed negligible, as the article shows. It doesn’t really matter if the author is biased or not, their analysis of the costs seems reasonable. I would need a counter-argument against that to think they were wrong. Just because they’re biased isn’t enough to discount the quantification they attempted to bring to the debate.

Also, I don’t think there’s any hypocrisy in me saying I’ve only thought about other solutions here and there - I’m not maintaining an anti-scraping library. And there’s already been indications that scrapers are just accepting the cost of Anubis on Codeberg, right? So I’m not trying to say I’m some sort of tech genius who has the right idea here, but from what Codeberg was saying, and from the numbers in this article, it sure looks like Anubis isn’t the right idea. I am indeed only having fun with my suggestions, not making whole libraries out of them and pronouncing them to be solutions. I personally haven’t seen evidence that Anubis is so clearly working? As the author points out, it seems like it’s only working right now because of how new it is, but if scrapers want to go through it, they easily can - which puts us in a sort of virus/antibiotic eternal war of attrition. And if course that is the case with many things in computing as well. So I guess my open wondering are just about if there’s ever any way to develop a countermeasure that the scrapers won’t find “worth it” to force through?

Edit for tone clarity: I’m don’t want to be antagonistic, rude, or hurtful in any way. Just trying to have a discussion and understand this situation. Perhaps I was arrogant, if so I apologize. It was also not my intent, fwiw. Also, thanks for helping me understand why I was getting downvoted. I intended my post to just be constructive spitballing about what I see as the eventual inevitable weakness in Anubis. I think it’s a great project and it’s great that people are getting use out of it even temporarily, and of course the devs deserve lots of respect for making the thing. But as much as I wish I could like it and believe it will solve the problem, I still don’t think it will.

dabe@lemmy.zip on 27 Aug 12:36 collapse

Well I can agree on the fact that the arms race situation we’re in sucks. It’s an old problem, seen in malware attacks and defenses. I’m just glad we have people fighting on our side in their spare time :’)

And it’s all good on the tone, thank you for your clarifications

possiblylinux127@lemmy.zip on 22 Aug 06:01 collapse

Anubis is more of a economic solution. It doesn’t stop bots but it does make companies pay more to access content instead of having server operators foot the bill.

unexposedhazard@discuss.tchncs.de on 21 Aug 14:50 next collapse

This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity.

Well it doesnt fucking matter what “makes sense to you” because it is working…
Its being deployed by people who had their sites DDoS’d to shit by crawlers and they are very happy with the results so what even is the point of trying to argue here?

daniskarma@lemmy.dbzer0.com on 22 Aug 06:38 collapse

It’s working because it’s not very used. It’s sort of a “pirate seagull” theory. As long a few people use it it works. Because scrappers don’t really count on Anubis so they don’t implement systems to surpass it.

If it were to become more common it would be really easy to implement systems that would defeat the purpose.

As of right now sites are ok because scrappers just send https requests and expect a full response. If someone wants to bypass Anubis protection they would need to take into account that they will receive a cryptographic challenge and have to solve it.

The thing is that cryptographic challenges can be very optimized. They are designed to run in a very inefficient environment as it is a browser. But if someone would take the challenge and solve it in a better environment using CUDA or something like that it would take a fraction of the energy defeating the purpose of “being so costly that it’s not worth scrapping”.

At this point it’s only a matter of time that we start seeing scrappers like that. Specially if more and more sites start using Anubis.

Lumisal@lemmy.world on 21 Aug 17:25 next collapse

Have you tried accessing it by using Nyarch?

TwiddleTwaddle@lemmy.blahaj.zone on 21 Aug 19:49 next collapse

I’m constantly unable to access Anubis sites on my primary mobile browser and have to switch over to Fennec.

VitabytesDev@feddit.nl on 21 Aug 22:31 next collapse

I love that domain name.

CrackedLinuxISO@lemmy.dbzer0.com on 21 Aug 23:30 next collapse

There are some sites where Anubis won’t let me through. Like, I just get immediately bounced.

So RIP dwarf fortress forums. I liked you.

sem@lemmy.blahaj.zone on 22 Aug 00:57 collapse

I don’t get it, I thought it allows all browser with JavaScript enabled.

SL3wvmnas@discuss.tchncs.de on 22 Aug 08:35 collapse

I, too get blocked by certain sites. I think it’s a configuration thing, where it does not like my combination of uBlock/NoScript, even when I explicitly allow their scripts…

rtxn@lemmy.world on 22 Aug 03:10 next collapse

New developments: just a few hours before I post this comment, The Register posted an article about AI crawler traffic. www.theregister.com/2025/…/ai_crawler_traffic/

Anubis’ developer was interviewed and they posted the responses on their website: xeiaso.net/notes/2025/el-reg-responses/

In particular:

Fastly’s claims that 80% of bot traffic is now AI crawlers

In some cases for open source projects, we’ve seen upwards of 95% of traffic being AI crawlers. For one, deploying Anubis almost instantly caused server load to crater by so much that it made them think they accidentally took their site offline. One of my customers had their power bills drop by a significant fraction after deploying Anubis. It’s nuts.

So, yeah. If we believe Xe, OOP’s article is complete hogwash.

tofu@lemmy.nocturnal.garden on 22 Aug 06:20 collapse

Cool article, thanks for linking! Not sure about that being a new development though, it’s just results, but we already knew it’s working. The question is, what’s going to work once the scrapers adapt?

possiblylinux127@lemmy.zip on 22 Aug 05:56 next collapse

Anubis sucks

However, the number of viable options is limited.

seralth@lemmy.world on 22 Aug 07:00 next collapse

Yeah but at least Anubis is cute.

I’ll take sucks but cute over dead internet and endless swarmings of zergling crawlers.

CommanderCloon@lemmy.ml on 22 Aug 12:45 collapse

What sucks about Anubis?

possiblylinux127@lemmy.zip on 22 Aug 17:46 collapse

The implementation

It runs JavaScript and the actual algorithm could use improvement.

daniskarma@lemmy.dbzer0.com on 22 Aug 05:59 next collapse

Sometimes I think. Imagine if a company like google or facebook would implement something like anubis. And suddenly most people’s browsers would start solving cpu intensive constant cryptographic challenges. People would be outraged by the wasted energy. But somehow “cool small company” does it and it’s fine.

I do not think anubis system is sustainable for all the people to use it, it’s just too wasteful energy wise.

Tangent5280@lemmy.world on 22 Aug 08:05 collapse

What alternatives do you propose?

daniskarma@lemmy.dbzer0.com on 22 Aug 08:08 collapse

Captcha.

It does all Anubis does. If a scrapper wants to solve it automatically it’s computer intensive, they have to run AI inference, but for the user it’s just a little time consuming.

With captchas you don’t run aggressive software unauthorized on anyone’s computer.

Solution did exist. But Anubis is “trendy” and they are masters in PR within some specific circles of people who always wants the lastest most trendiest thing.

But good old captcha would achieve the same result as Anubis, in a more sustainable way.

Or at least give user an option of running or not running the challenge and leave the page. And make clear for the user that their hardware is going to run an intensive task. It really feels very aggressive to have a webpage to run basically a cryptominer unauthorized in your computer. And for me having a cargirl as a mascot does not forgive the rudeness of it.

tofu@lemmy.nocturnal.garden on 22 Aug 08:48 next collapse

“good old captcha” is the most annoying thing ever for people and basically universally hated. Talking about leaving the page, what do you think what will cause more people to leave the page, a captcha that’s often broken or something where people don’t have to do anything but wait a little?

Randelung@lemmy.world on 22 Aug 08:58 next collapse

Also universally useless. Image recognition solved Captcha ages ago and the new version from Google is literal spyware.

Chuppl has a great video essay on it. youtu.be/VTsBP21-XpI

daniskarma@lemmy.dbzer0.com on 22 Aug 09:15 collapse

They don’t have to do anything but let an unknown program to max their cpu unauthorized.

Imagine if google would implement that. Billions of computers running PoW constantly, what could go wrong?

tofu@lemmy.nocturnal.garden on 22 Aug 09:17 collapse

They don’t have to do anything but let an unknown program to max their cpu unauthorized.

But they currently can’t and that’s the point.

cupcakezealot@piefed.blahaj.zone on 22 Aug 10:16 collapse

but captcha is trash whose only purpose is to train ai for google

daniskarma@lemmy.dbzer0.com on 22 Aug 10:17 collapse

What?

You don’t need to use google, or cloudfare, captcha to have a captcha.

There are open source implementations of reCaptcha. And you can always run a classical captcha based on image recognition.

cupcakezealot@piefed.blahaj.zone on 22 Aug 10:54 collapse

google is like 95% of the captchas on the internet.

daniskarma@lemmy.dbzer0.com on 22 Aug 10:56 collapse

So? You have free will to use another captcha.

Dremor@lemmy.world on 22 Aug 09:12 next collapse

Anubis is no challenge like a captcha. Anubis is a ressource waster, forcing crawler to resolve a crypto challenge (basically like mining bitcoin) before being allowed in. That how it defends so well against bots, as they do not want to waste their resources on needless computing, they just cancel the page loading before it even happen, and go crawl elsewhere.

tofu@lemmy.nocturnal.garden on 22 Aug 09:15 collapse

No, it works because the scraper bots don’t have it implemented yet. Of course the companies would rather not spend additional compute resources, but their pockets are deep and some already adapted and solve the challenges.

Dremor@lemmy.world on 22 Aug 10:03 next collapse

To solve it or not do not change that they have to use more resources for crawling, which is the objective here. And by contrast, the website sees a lot less load compared to before the use of Anubis. In any case, I see it as a win.

But despite that, it has its detractors, like any solution that becomes popular.

But let’s be honest, what are the arguments against it?
It takes a bit longer to access for the first time? Sure, but that’s not like you have to click anything or write anything.
It executes foreign code on your machine? Literally 90% of the web does these days. Just disable JavaScript to see how many website is still functional. I’d be surprised if even a handful does.

The only people having any advantages at not having Anubis are web crawler, be it ai bots, indexing bots, or script kiddies trying to find a vulnerable target.

tofu@lemmy.nocturnal.garden on 22 Aug 10:27 next collapse

Sure, I’m not arguing against Anubis! I just don’t think the added compute cost is sufficient to keep them out once they adjust.

rumba@lemmy.zip on 23 Aug 02:40 collapse

Conceptually, you could just really twist the knobs up. A human can wait to read a page for 15 seconds. But you’re trying to scrape 100,000 pages and they each take 15 seconds… You can make it expensive in both power and time that’s a win.

daniskarma@lemmy.dbzer0.com on 22 Aug 11:39 next collapse

I’m against it for several reasons. Running unauthorized heavy duty code on your end. It’s not JS in order to make your site functional, it’s heavy calculations unprompted. If they would add simple button “click to run challenge” would at least be more polite and less “malware-like”.

For some old devices the challenge last over 30 seconds, I can type a captcha in less time than that.

It blocks behind the necessity to use a browser several webs that people (like the article author) tend to browse directly from a terminal.

It’s a delusion. As shown by the article author solving the PoW challenge is not that much of an added cost. Span reduction would be the same with any other novel method, crawlers are just not prepared for it. Any prepared crawler would have no issues whatsoever. People are seeing results just because it’s obscurity, not because it really works as advertised. And in fact I believe some sites are starting to get crawled aggressively despite anubis as some crawlers are already catching up with this new Anubis trend.

Take into account that the challenge needs to be light enough so a good user can enter the website in a few seconds running the challenge on a browser engine (very inefficient). A crawler interested in your site could easily put up a solution to mine the PoW using CUDA in a GPU which would be hundreds if not thousands of times more efficient. So the balance of difficulty (still browsable for users but costly to crawl) is not feasible.

It’s not universally applicable. Imagine if all internet were behind PoW challenges. It would be like constant Bitcoin mining, a total waste of resources.

The company behind Anubis seems more shady to me each day. They feed on anti-AI paranoia, they didn’t even answer the article author valid critics when he email them, they use clearly PR language aimed to convince and please certain demographics to place their product. They are full of slogans but lack substance. I just don’t trust them.

Dremor@lemmy.world on 22 Aug 12:26 collapse

Fair point. I do agree with the “clic to execute challenge” approach.

For the terminal browser, it has more to do with it not respecting web standard than Anubis not working on it.

As for old hardware, I do agree that a temporization could be good idea, if it wasn’t so easy to circumvent. In such case bots would just wait in the background and resume once the timer is fullified, which would vastly decrease Anubis effectiveness as they don’t uses much power to do so. There isn’t really much that can be done here.

As for the CUDA solution, that will depend on the implemented hash algorithm. Some of them (like the one used by Monero) are made to vastly more inefficient on GPU than it is on the CPU. Moreover, GPU servers are far more expensive to run than CPU ones, so the result would be the same : crawling would be more expensive.

In any case, the best solution would be by far to make it a legal requirement to respect robot.txt, but for now the legislators prefer to look the other way.

int32@lemmy.dbzer0.com on 22 Aug 22:11 collapse

I use uMatrix, which blocks js by default, so it is a bit inconvenient to have to enable js for some sites. websites which didn’t need it before, which is often the reason I use them, now require javascript.

EncryptKeeper@lemmy.world on 22 Aug 22:53 collapse

The point was never that Anubis challenges are something scrapers can’t get past. The point is it’s expensive to do so.

Some bots don’t use JavaScript and can’t solve the challenges and so they’d be blocked, but there was never any point in time where no scrapes could solve them.

JuxtaposedJaguar@lemmy.ml on 22 Aug 23:09 collapse

Wait, so browsers that disable JavaScript won’t be able to access those websites? Then I hate it.

Not everyone wants unauthenticated RCE from thousands of servers around the world.

EncryptKeeper@lemmy.world on 22 Aug 23:52 collapse

Not everyone wants unauthenticated RCE from thousands of servers around the world.

Ive got really bad news for you my friend

cupcakezealot@piefed.blahaj.zone on 22 Aug 10:16 next collapse

because anime catgirls are the best

Klear@quokk.au on 22 Aug 12:40 collapse

If that sounds familiar, it’s because it’s similar to how bitcoin mining works. Anubis is not literally mining cryptocurrency, but it is similar in concept to other projects that do exactly that

Did the author only now discover cryptography? It's like a cryptocurrency, just without currency, what a concept!

ChaoticEntropy@feddit.uk on 22 Aug 22:23 next collapse

It’s quite similar.

SkaveRat@discuss.tchncs.de on 23 Aug 01:23 collapse

It’s a perfectly valid way to explain it, though

If you try to show up with “cryptography” as an explanation, people will think of encrypting messages, not proof of work

“Cryptocurrency with the currency” really is the perfect single sentence explanation