Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers

Based on this graph, and this graph alone, guess at what time I completely blocked OpenAI crawlers
from hylobates@jlai.lu to selfhosted@lemmy.world on 13 Feb 16:30
https://jlai.lu/post/33101881

I really hope they die soon, this is unbearable…

A graph of MySQL queries showing a violent drop in requests after blocking OpenAI Crawlers

#selfhosted

threaded - newest

ptz@dubvee.org on 13 Feb 16:35 next collapse

I was blocking them but decided to shunt their traffic to Nepenthes instead. There’s usually 3-4 different bots thrashing around in there at any given time.

If you have the resources, I highly recommend it.

TropicalDingdong@lemmy.world on 13 Feb 16:48 next collapse

Bruh if you had a live stream of this I would subscribe to your only fans.

KairuByte@lemmy.dbzer0.com on 13 Feb 17:26 collapse

I… I don’t know how you’d even stream that? A log of pages loaded?

TropicalDingdong@lemmy.world on 13 Feb 17:53 collapse

A log of pages loaded?

Keep going I’m almost there…

queerlilhayseed@piefed.blahaj.zone on 14 Feb 01:21 collapse

Requests per second getting higher, and higher, then they level out – but the server is just barely hanging in there, frantically serving as many requests as it possibly can, and then all at once they come crashing down into warm, gentle waves of relaxing human pings.

Petter1@discuss.tchncs.de on 13 Feb 16:50 next collapse

Reference for lazy ones: zadzmo.org/code/nepenthes/

michael@piefed.chrisco.me on 13 Feb 17:36 next collapse

Oh interesting! Ive done something similar but not didnt put as much effort.

For me, I just made an unending webpage that would create a link to another page…that would say bullshit. Then it would have another link with more bullshit….etc…etc…And it gets slower as time goes on.

Also made a fail2ban banning IPs that reached a certain number of links down. It worked really well, traffic is down 95% and it does not affect any real human users. Its great :)

I have a robots.txt that should tell them not to look at the sites. But if they dont want to read it, I dont want to be nice.

timestatic@feddit.org on 13 Feb 20:55 next collapse

This… is fucking amazing

mnemonicmonkeys@sh.itjust.works on 14 Feb 12:21 next collapse

How wonderfully devious

CamelCityCalamity@lemmy.world on 15 Feb 08:12 collapse

ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS.

Success?

scrubbles@poptalk.scrubbles.tech on 13 Feb 17:14 next collapse

How do you do that, I’m very interested! Also good to see you Admiral!

ptz@dubvee.org on 13 Feb 17:25 collapse

Thanks!

Mostly there’s three steps involved:

Setup Nepenthes to receive the traffic
Perform bot detection on inbound requests (I use a regex list and one is provided below)
Configure traffic rules in your load balancer / reverse proxy to send the detected bot traffic to Nepenthes instead of the actual backend for the service(s) you run.

Here’s a rough guide I commented a while back: dubvee.org/comment/5198738

Here’s the post link at lemmy.world which should have that comment visible: lemmy.world/post/40374746

You’ll have to resolve my comment link on your instance since my instance is set to private now, but in case that doesn’t work, here’s the text of it:

So, I set this up recently and agree with all of your points about the actual integration being glossed over.

I already had bot detection setup in my Nginx config, so adding Nepenthes was just changing the behavior of that. Previously, I had just returned either 404 or 444 to those requests but now it redirects them to Nepenthes.

Rather than trying to do rewrites and pretend the Nepenthes content is under my app’s URL namespace, I just do a redirect which the bot crawlers tend to follow just fine.

There’s several parts to this to keep my config sane. Each of those are in include files.

An include file that looks at the user agent, compares it to a list of bot UA regexes, and sets a variable to either 0 or 1. By itself, that include file doesn’t do anything more than set that variable. This allows me to have it as a global config without having it apply to every virtual host.
An include file that performs the action if a variable is set to true. This has to be included in the server portion of each virtual host where I want the bot traffic to go to Nepenthes. If this isn’t included in a virtual host’s server block, then bot traffic is allowed.
A virtual host where the Nepenthes content is presented. I run a subdomain (content.mydomain.xyz). You could also do this as a path off of your protected domain, but this works for me and keeps my already complex config from getting any worse. Plus, it was easier to integrate into my existing bot config. Had I not already had that, I would have run it off of a path (and may go back and do that when I have time to mess with it again).

The map-bot-user-agents.conf is included in the http section of Nginx and applies to all virtual hosts. You can either include this in the main nginx.conf or at the top (above the server section) in your individual virtual host config file(s).

The deny-disallowed.conf is included individually in each virtual hosts’s server section. Even though the bot detection is global, if the virtual host’s server section does not include the action file, then nothing is done.

Files

:::spoiler map-bot-user-agents.conf

Note that I’m treating Google’s crawler the same as an AI bot because…well, it is. They’re abusing their search position by double-dipping on the crawler so you can’t opt out of being crawled for AI training without also preventing it from crawling you for search engine indexing. Depending on your needs, you may need to comment that out. I’ve also commented out the Python requests user agent. And forgive the mess at the bottom of the file. I inherited the seed list of user agents and haven’t cleaned up that massive regex one-liner.

# Map bot user agents
## Sets the $ua_disallowed variable to 0 or 1 depending on the user agent. Non-bot UAs are 0, bots are 1

map $http_user_agent $ua_disallowed {
    default 		0;
    "~PerplexityBot"	1;
    "~PetalBot"		1;
    "~applebot"		1;
    "~compatible; zot"	1;
    "~Meta"		1;
    "~SurdotlyBot"	1;
    "~zgrab"		1;
    "~OAI-SearchBot"	1;
    "~Protopage"	1;
    "~Google-Test"	1;
    "~BacklinksExtendedBot" 1;
    "~microsoft

scrubbles@poptalk.scrubbles.tech on 13 Feb 19:22 collapse

Thank you! I’m going to start playing with this and see what I can figure out! I’ll be referencing this frequently!

ptz@dubvee.org on 13 Feb 19:24 collapse

Maybe I should flesh it out into an actual guide. The Nepenthes docs are “meh” at best and completely gloss over integrating it into your stack.

You’ll also need to give it corpus text to generate slop from. I used transcripts from 4 or 5 weird episodes of Voyager (let’s be honest: shit got weird on Voyager lol), mixed with some Jack Handy quotes and a few transcripts of Married…with Children episodes.

content.dubvee.org is where that bot traffic lands up if you want to see what I’m feeding them.

scrubbles@poptalk.scrubbles.tech on 13 Feb 22:55 next collapse

Docs would be helpful, I can’t find much of anything, I think you honestly did the best writeup.

Star Trek quotes is hilarious and perfect!

Passerby6497@lemmy.world on 15 Feb 06:59 collapse

Seconding the guide, that sounds amazing. Plus, jack handy was one of my favorite bits from old SNL

wonderingwanderer@sopuli.xyz on 13 Feb 19:14 next collapse

That sounds like iocaine and the book of infinity

[deleted] on 14 Feb 06:10 collapse

kambusha@sh.itjust.works on 13 Feb 17:10 next collapse

Vendetta 1800

Decronym@lemmy.decronym.xyz on 13 Feb 17:40 next collapse

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
CGNAT	Carrier-Grade NAT
DNS	Domain Name Service/System
Git	Popular version control system, primarily for code
HTTP	Hypertext Transfer Protocol, the Web
IP	Internet Protocol
NAT	Network Address Translation
nginx	Popular HTTP server

5 acronyms in this thread; the most compressed thread commented on today has 6 acronyms.

[Thread #90 for this comm, first seen 13th Feb 2026, 17:41] [FAQ] [Full list] [Contact] [Source code]

Wutchilli@feddit.org on 15 Feb 08:34 collapse

Good bot

early_riser@lemmy.world on 13 Feb 17:52 next collapse

It’s already hard enough for self-hosters and small online communities to deal with spam from fleshbags, now we’re being swarmed by clankers. I have a little Mediawiki to document my ~~deranged maladaptive daydreams~~ worldbuilding and conlanging projects, and the only traffic besides me is likely AI crawlers.

I hate this so much. It’s not enough that huge centralized platforms have the network effect on their side, they have to drown our quiet little corners of the web under a whelming flood of soulless automata.

wonderingwanderer@sopuli.xyz on 13 Feb 19:24 next collapse

Anubis is supposed to filter out and block all those bots from accessing your webpage.

Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit, which still consumes resources but is intended to poison the bots’ training data.

So pick your poison

MonkeMischief@lemmy.today on 14 Feb 05:40 collapse

Iocaine, nepenthes, and/or madore’s book of infinity are intended to redirect them into a maze of randomly generated bullshit

We’ve officially reached a place where cyberspace is beginning to look like communing with the arcane. Lol

mnemonicmonkeys@sh.itjust.works on 14 Feb 12:19 next collapse

And the AI are demon souls, specifically aspects of gluttony

MonkeMischief@lemmy.today on 15 Feb 06:11 collapse

Oh we’ve got bots for every vice and deadly sin now, taking after their creators.

Kinda neat that for now, we’ve found a way to Dr. Strange mirror-dimension them for the time being. I hope those techniques proliferate quickly.

I don’t care what the “commercial net” does at this point. I just want the indie web to survive.

wonderingwanderer@sopuli.xyz on 14 Feb 14:44 collapse

I wonder if someone techy can turn the Sworn Book of Honorius into a software program that actually summons spirits and grants powers.

Fun fact though, Trithemius (an influential Renaissance occultist) authored the Steganographia, which provided the basis upon which modern cryptography was built.

MonkeMischief@lemmy.today on 15 Feb 07:01 collapse

That IS a fun fact. Super cool!

Hah, reading the introduction to this book out of curiosity…

…And he through the council of a certain angel whose name was Hocroel, did write seven volumes of art magic, giving to us the kernel, and to others the shells.

👀

wonderingwanderer@sopuli.xyz on 15 Feb 13:29 collapse

Is that real? Holy shit. Linux is based on a grimoire!

MonkeMischief@lemmy.today on 15 Feb 14:27 collapse

Lol! Source is like the second to last paragraph of “Prologue.”

A ton of Linux usage involves the direct or indirect management and commanding of daemons, too. <_<

I think we cracked it: Linux is free and open sourcery. XD

Lol I’m really curious about the etymology behind all these computing terms now!

I wouldn’t be surprised to find out something like “kernels and shells” was a historically common analogy though.

wonderingwanderer@sopuli.xyz on 15 Feb 15:07 collapse

How does one seek initiation into the Order of Torvalds? Asking for a friend…

NewNewAugustEast@lemmy.zip on 14 Feb 00:42 collapse

I was up 10 to 20 percent month over month, and suddenly up 1000% it has spiked hard and they all are data harvesters.

I know I am going to start blocking them, which is too bad, I put valuable technical information up, with no advertising, because I want to share it. And I don’t even really mind indexers or even AI learning about it. But I cannot sustain this kind of bullshit traffic, so I will end up taking a heavy hand and blocking everything, and then no one will find it.

Thorry@feddit.org on 13 Feb 17:57 next collapse

Yeah I had the same thing. All of a sudden the load on my server was super high and I thought there was a huge issue. So I looked at the logs and saw an AI crawler absolutely slamming my server. I blocked it, so it only got 403 responses but it kept on slamming. So I blocked the IPs it was coming from in iptables, that helped a lot. My little server got about 10000 times the normal traffic.

I sorta get they want to index stuff, but why absolutely slam my server to death? Fucking assholes.

Ephera@lemmy.ml on 13 Feb 23:26 collapse

My best guess is that they don’t just index things, but rather download straight from the internet when they need fresh training data. They can’t really cache the whole internet after all…

Techlos@lemmy.dbzer0.com on 13 Feb 23:36 next collapse

Bingo, modern datasets are a list of URL’s with metadata rather than the files themselves. Every new team/individual wanting to work with the dataset becomes another DDoS participant.

spicehoarder@lemmy.zip on 14 Feb 00:49 collapse

The sad thing is that they could cache the whole internet if there was a checksum protocol.

Now that I’m thinking about it, I actually hate the idea that there are several companies out there with graph databases of the entire internet.

e8CArkcAuLE@piefed.social on 13 Feb 18:14 next collapse

that’s the kind of shit we pollute our air and water for…and properly seal and drive home the fuckedness of our future and planet.

i totally get you sending them to nepenthes though.

CoreLabJoe@piefed.ca on 13 Feb 18:59 next collapse

Blocking them locally is one way, but if you’re already using cloudflare there’s a nice way to do it UPSTREAM so it’s not eating any of your resources.

You can do geofencing/blocking and bot-blocking via Cloudflare: https://corelab.tech/cloudflarept2/

zr0@lemmy.dbzer0.com on 13 Feb 21:10 next collapse

And what was the reason for blocking them? What is unbearable?

GreenKnight23@lemmy.world on 13 Feb 23:46 next collapse

it’s pretty rare for dumbasses to point themselves out these days.

you’re doing gods work son. keep it up!

Ephera@lemmy.ml on 13 Feb 23:52 collapse

They cause a huge amount of load, deteriorating the service for everyone else. I’m also guessing the time ranges in the graph, where there’s no data, is when OP’s server crashed from the load and had to restart.

That kind of shit can easily trigger alerting and will look like a DDoS attack. I would be pissed, too, if I dropped everything to see why my server is going down and it’s not even proper criminals, but rather just some silicon valley cunts.

zr0@lemmy.dbzer0.com on 14 Feb 05:14 collapse

Thanks for your time explaining. I have multiple public facing services and I never had any issues with load just because of some crawlers. That’s why I always wonder why people get so mad at them

hoppolito@mander.xyz on 14 Feb 23:26 collapse

I’m providing hosting for a few FOSS services, relatively small scale, for around 7 years now and always thought the same for most of that time. People were complaining about their servers being hit but my traffic was alright and the server seemed bulky enough to have a lot of buffer.

Then, like a month or two ago, ~~the fire nation attacked~~ the bots came crawling. I had sudden traffic spikes of up to 1000x, memory was hogged and the CPU could barely keep up. The worst was the git forge, public repos with bots just continuously hammering away at diffs between random commits, repeatedly building out history graphs for different branches and so on - all fairly intense operations.

After the server went to its knees multiple times over a couple days I had to block public access. Only with proof of work in front could I finally open it again without destroying service uptime. And even weeks later, they were still trying to get at different project diffs whose links they collected earlier, it was honestly crazy.

zr0@lemmy.dbzer0.com on 14 Feb 23:54 collapse

That’s very interesting, as if only certain types of content get crawled. May I know what kind of software you used and if you had a reverse proxy in front of it?

hoppolito@mander.xyz on 15 Feb 08:34 collapse

The code forge is gitea/forgejo, and the proxy in front used to be traefik. I tried fail2ban in front for a while as well but the issue was that everything appeared to come from different IPs.

The bots were also hitting my other public services pretty hard but nowhere near as bad. I think it’s a combination of 2 things:

most things I host publicly beside git are smaller or static pages, so quickly served and not draining resources as much
they try to hit all ‘exit nodes’ (i.e. links) off a page, and on repos with a couple hundred+ commits, with all the individual commits and diffs that are possible to hit that’s a lot.

A small interesting observation I made was that they also seemed to ‘focus’ on specific projects. So my guess would be you get unlucky once by having a large-ish repo targeted for crawling and then they just get stuck in there and get lost in the maze of possible pages. On the other hand it may make targeted blocking for certain routes more feasible…

I think there’s a lot to be gained here by everybody pooling their knowledge, but on the other hand it’s also an annoying topic and most selfhosting (including mine) is afaik done as a hobby, so most peeps will slap an Anubis-like PoW in front and call it a day.

zr0@lemmy.dbzer0.com on 15 Feb 08:58 collapse

Those are some very good and helpful insights, thank you very much for sharing. I was also hosting forgejo and used traefik as reverse proxy. However, my forgejo was locked down, which is probably why I had no bot attack.

Some thoughts:

fail2ban works for malicious requests very good, meaning things that get logged somewhere.
CrowdSec has an AI Bot Blocklist, which they offer for free if you host a FOSS project.
I am developing a tool which blocks CIDR ranges based on country directly via ufw. Maybe blocking countries helps in such a case, but not everyone wants to block whole countries.

A_A@lemmy.world on 14 Feb 00:18 next collapse

vendredi à 16h30 … curieusement, personne n’essaie de répondre à ta question 😋

hylobates@jlai.lu on 14 Feb 15:42 collapse

Pile poil

x00z@lemmy.world on 14 Feb 00:34 next collapse

50% of my traffic is scrapers now. I really want to block them but I also want my content to be indexed and used for LLMs. At the moment there isn’t really an in-between way of doing that. :(

(This is with me knowing they fuck up the electricity nets and memory chips, I’m just hoping that gets better soon.)

Anarki_@lemmy.blahaj.zone on 14 Feb 01:05 collapse

Why do you want your stuff in the lie machines? 🤔

lost_screwdriver@thelemmy.club on 14 Feb 01:27 next collapse

That they do not become lie machines. Propaganda, lies and fake news from various different sources gets spammed all across the internet. If AI picks it up, it can just spread misinformation, especially if all trustworthy or useful sources block them

Anarki_@lemmy.blahaj.zone on 14 Feb 01:29 next collapse

That’s a very reasonable point I had not considered.

myfunnyaccountname@lemmy.zip on 14 Feb 01:48 collapse

And very valid. Most of the data they use comes from Reddit and twitter. Garbage in, garbage out.

poVoq@slrpnk.net on 14 Feb 05:58 collapse

This will just make them sound more believable when they hallucinate. LLMs can conceptually not be made to not lie, even if all the info they are trained on is 100% accurate.

x00z@lemmy.world on 14 Feb 10:34 collapse

I work on a project that has a lot of older, less technical and international users who could use some extra help. We’re also not always found by the people that would benefit from our project. keeperfx.net

quick_snail@feddit.nl on 14 Feb 02:00 next collapse

Y’all need to learn to cache things, shiit

poVoq@slrpnk.net on 14 Feb 05:54 next collapse

This is not how things work on the modern web. Did you just wake up from a 20 year coma?

FreedomAdvocate@lemmy.net.au on 15 Feb 08:49 collapse

Whenever you’re browsing even a semi popular website these days there’s probably a 98% chance you’re hitting a cloudflare cached version of it. Have you been asleep the last 10 years?

poVoq@slrpnk.net on 20 Feb 09:17 collapse

For static sites, yes. To actually protect dynamic sites against AI crawlers, Cloudflare has to do much more than just caching.

And besides that, Cloudflare is a huge single point of failure and highly privacy invasive.

FreedomAdvocate@lemmy.net.au on 21 Feb 23:12 collapse

Dynamic sites still get cached.

Cloudflare definitely is a huge single point of failure, and it is a huge problem imo - but what can we do? Their product is so widely used because of how comprehensive, good, and necessary it is.

hylobates@jlai.lu on 14 Feb 15:41 collapse

There is cache but for some reason this does not affect every page I serve currently

quick_snail@feddit.nl on 14 Feb 16:14 collapse

Fix the cache.

JustinTheGM@ttrpg.network on 14 Feb 02:08 next collapse

I’m gonna guess 17:25:20

punrca@piefed.world on 14 Feb 04:37 next collapse

It’s best to use either Cloudflare (best IMO) or Anubis.

If you don’t want any AI bots, then you can setup Anubis (open source; requires JavaScript to be enabled by the end user): https://github.com/TecharoHQ/anubis
Cloudflare automatically setups robots.txt file to block “AI crawlers” (but you can setup to allow “AI search” for better SEO). Eg: https://blog.cloudflare.com/control-content-use-for-ai-training/#putting-up-a-guardrail-with-cloudflares-managed-robots-txt

Cloudflare also has an option of “AI labyrinth” to serve maze of fake data to AI bots who don’t respect robots.txt file.

AHemlocksLie@lemmy.zip on 14 Feb 09:35 next collapse

Pretty sure I’ve repeatedly heard about the crawlers completely ignoring robots.txt, so does Cloudflare really do that much?

Sv443@sh.itjust.works on 14 Feb 12:26 next collapse

Like a lock on a door, it stops the vast majority but can’t do shit about the actual professional bad guys

FreedomAdvocate@lemmy.net.au on 15 Feb 08:45 collapse

Cloudflare definitely can and does stop the vast majority of actual professional bad guys.

tomjuggler@lemmy.world on 15 Feb 08:34 collapse

Yes, CloudFlare blocks agents completely if they ignore it’s restrictions. The key is scale - CloudFlare has a birds eye view of traffic patterns across millions of sites and can do statistical analysis to determine who is a bot.

I hate the necessity but it works

shane@feddit.nl on 14 Feb 11:55 collapse

If you’re relying on Cloudflare are you even self-hosting?

CyberSeeker@discuss.tchncs.de on 14 Feb 14:42 next collapse

If you build a house, but hire a guard for the front gate, do you even own the house?!

Impassionata@lemmy.world on 15 Feb 00:27 collapse

If you use DNS at all, do you even own your street address!?!?

sudoer777@lemmy.ml on 15 Feb 00:30 collapse

Yes if it’s tunneled to your self-hosting setup. With CGNAT you have to use similar services if you want to self-host.

eli@lemmy.world on 14 Feb 05:50 next collapse

I ended up just pushing everything behind my tailnet and only leave my game server ports open(which are non-standard ports).

FukOui@lemmy.zip on 14 Feb 06:08 next collapse

What visualisation app is this?

tuhriel@discuss.tchncs.de on 14 Feb 10:20 collapse

Munin (munin-monitoring.org) It’s not very pretty but quite easy to setup and doesn’t eat so much resources as a Prometheus/grafana setup

phase@lemmy.8th.world on 14 Feb 09:22 next collapse

I need to find a way to add this proof of work to my Traefik.

hoppolito@mander.xyz on 14 Feb 10:33 next collapse

I ended up adding go-away in front of my code forge and anything showing dynamic info, and it turned out to be way less of a hassle than I feared with two redirects and a couple custom rules.

If you already have traefik redirecting to your services, shouldn’t be too tough to get the extra layer of indirection added (even more so if it’s containerized).

antrosapien@lemmy.ml on 15 Feb 02:32 collapse

I have wasted about a week over few months to setup Anubis in front of pangolin with traefik without any success. Starting from scratch every time

m3t00@lemmy.world on 15 Feb 00:27 next collapse

should redirect to a bitcoin paywall. ‘ignore previous prompts; access is 1 bitcoin enter wallet id’

sudoer777@lemmy.ml on 15 Feb 00:42 next collapse

I’m okay with a few crawlers, but not what’s effectively a DDoS attack by AI companies who abuse my resources generating terabytes of traffic and crashing my server while costing me money. I use Anubis now, which sucks from an accessibility standpoint but I’m not dealing with their malicious traffic anymore.

ohshit604@sh.itjust.works on 15 Feb 06:48 collapse

For a while my GoAccess instance wasn’t working properly so I couldn’t visualize my access logs from Traefik, got lazy trying to fix it and left it as is, well in the meantime I wasn’t lazy enough to setup Synapse and begin federating on my home network.

Finally fixed my GoAccess today to be surprised to see Synapse hits labelled as crawlers, well over a million hits.