Self-host Reddit – 2.38B posts, works offline, yours forever (github.com)
from 19_84@lemmy.dbzer0.com to selfhosted@lemmy.world on 13 Jan 2026 15:45
https://lemmy.dbzer0.com/post/61554734

Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

USB drive / local folder (just open the HTML files)
Home server on your LAN
Tor hidden service (2 commands, no port forwarding needed)
VPS with HTTPS
GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.

Live demo: online-archives.github.io/redd-archiver-example/ GitHub: github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: academictorrents.com/…/1614740ac8c94505e4ecb9d88b…

#selfhosted

threaded - newest

frongt@lemmy.zip on 13 Jan 2026 16:00 next collapse

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:03 next collapse

Yes! Too many comments to count in a reasonable amount of time!

douglasg14b@lemmy.world on 13 Jan 2026 22:30 collapse

Yeah, it should inflate to 15TB or more I think

muusemuuse@sh.itjust.works on 14 Jan 2026 01:27 collapse

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

psycotica0@lemmy.ca on 14 Jan 2026 04:45 collapse

Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

breakingcups@lemmy.world on 13 Jan 2026 16:03 next collapse

Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:05 next collapse

Yes I used AI, English is not my first language. Thank you for the kind words!

mustlane@lemmy.zip on 13 Jan 2026 22:17 next collapse

It’s not an excuse. You are just lazy.

MadMonkey@lemmy.world on 13 Jan 2026 22:25 next collapse

Brush, you do not seem like a nice person to be around.

Spread love and kindness, not hate.

I hope you have a better rest of your day.

idealism_nearby@lemmy.world on 13 Jan 2026 22:26 next collapse

Would love to see you learn an entire foreign language just so you are able to communicate with the world without being laughed at by people as hostile as yourself.

fartographer@lemmy.world on 14 Jan 2026 06:09 next collapse

I can’t even learn my own language!

potustheplant@feddit.nl on 14 Jan 2026 15:56 next collapse

They said it wasn’t their “first” lanugage. Which leads me to believe that they do speak English. If that’s the case, then they indeed are kind of lazy. There have already been studies in the impact of AI when used for communication and the results are not positive.

This isn’t something I’d personally point out and criticize, just something I wouldn’t do personally. Take the time to express your own ideas in your own words. The long term cost is higher than the short term gains.

sukhmel@programming.dev on 14 Jan 2026 18:13 next collapse

I have A1 and A2 level in a couple of non-first languages, technically I can speak those, realistically I don’t and will not be able to communicate something more complex than ‘here, take a look’

So I don’t agree with your absolutistic stance

potustheplant@feddit.nl on 15 Jan 2026 02:47 collapse

There’s nothing “absolutistic” about my “stance”. If you’re rusty using a language, you won’t get better if someone else does the homework for you. Make an effort, make mistakes, write in a way that sounds weird, who cares. But practice. If you only take the easy way out, that’ll be your only option in the future.

Although, like I already said, that’s MY way of thinking about it. If you want to use ai to write your stuff, you do you. It doesn’t negate the fact that, whle it’s not “wrong”, it’s the lazy (or minimum effort) option. Don’t know why it bothers you so much.

rumba@lemmy.zip on 14 Jan 2026 20:20 collapse

Hey I drove to the library, picked up all these things you needed, got dinner here ya go, free!

You drove? man that’s lazy…

He used AI to clean up translation and save time after he spent a fuck ton of time curating and delivering us a helpful product. Calling him out as lazy is an awful take.

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:37 next collapse

there are the so called activists that complain alot then there are the activists that deliver projects and code… enough said

potustheplant@feddit.nl on 15 Jan 2026 02:32 collapse

“Activists”? What are you even talking about?

Regardless, I specifically said that what you did wasn’t wrong or anything likw that. I simply think that it’s going to do you more harm than good in the long run. You’re free to do whatever you want though, obviously.

Another piece of advice. When someone simply shares an opinion, don’t get instantly butthurt over nothing. Otherwise this might as well be reddit.

potustheplant@feddit.nl on 15 Jan 2026 02:41 collapse

First, that’s an awful analogy.

Second, you’re assuming (for some unknown reason) that they “cleaned up” the “translation” using ai. You have literally no idea exactly how they wrote the post. It’s kinda weird to make up a random scenario but ok.

Third, no, it’s not an awful take. You can code something that requires a ton of effort but write awful documentation. One thing does not make the other impossible.

Fourth, I already explained that there have already been studies that concluded that using AI to write stuff for you has a negative impact on your communication skills. This is not an opinion or me being ingrateful or whatever. I was just sharing information.

rumba@lemmy.zip on 15 Jan 2026 14:33 collapse

If that documentation was awful, I’d REALLY like to see your take on NixOS :)

potustheplant@feddit.nl on 15 Jan 2026 16:49 collapse

That was a hypothetic example, not my opinion on this post.

boonhet@sopuli.xyz on 15 Jan 2026 06:42 collapse

I mean I can’t see what the comment was and I’m assuming it must’ve been downright hateful, but that person almost certainly has learned a foreign language just to communicate with the world and in fact had to learn another foreign language in school because their name is Estonian for “gypsy” and learning two foreign languages (usually English and Russian, sometimes German or something else for the second foreign language) is required. Likely they speak 2.5 languages as is common here (my German is so bad I count it as half a language - native speakers speak too fast for me, but I can kinda get my point across if needed), but could be more.

Just pointing out that even when trying to be accepting of others, subtle anglo-defaultism can show up in your comment, not that I necessarily agree with whatever the comment was.

irmadlad@lemmy.world on 13 Jan 2026 22:41 next collapse

Yu mussi bawn backacow

v3ctors@piefed.blahaj.zone on 13 Jan 2026 22:43 collapse

Shut the fuck up loser.

Melvin_Ferd@lemmy.world on 13 Jan 2026 23:17 collapse

You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

rumba@lemmy.zip on 14 Jan 2026 20:13 collapse

Yup, if there was ever a decent use for AI, this is it. Lemmy can (and will) hate the shit out of it, but it took a little burden off the shoulders of someone doing us a great service.

Melvin_Ferd@lemmy.world on 13 Jan 2026 23:15 collapse

I fucking hate lemmy sometimes.

UnderpantsWeevil@lemmy.world on 13 Jan 2026 16:10 next collapse

I would sooner download a tire fire.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:15 next collapse

thanks anyway for looking at my project 🙂

irmadlad@lemmy.world on 13 Jan 2026 16:31 collapse

I use Reddit for reference through RedLib. I could see how having an on-premise repository would be helpful. How many subs were scrapped in this 3.28 TB backup? Reason for asking, I’d have little interest in say News or Politics, but there are some good subs that deal with Linux, networking, selfhosting, some old subs I used to help moderate like r/degoogle, r/deAmazon, etc.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:34 collapse

the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent 🙂

irmadlad@lemmy.world on 13 Jan 2026 16:36 collapse

Sweet! I’ll check it out.

Gerudo@lemmy.zip on 13 Jan 2026 20:47 collapse

Say what you will about Reddit, but there is tons of information on that platform that’s not available anywhere else.

UnderpantsWeevil@lemmy.world on 13 Jan 2026 20:49 collapse

:-/

You can definitely mine a bit of gold out of that pile of turds. But you could also go to the library and receive a much higher ratio of signal to noise.

pixeltree@lemmy.blahaj.zone on 14 Jan 2026 17:40 collapse

This one specific bug in this one niche library has probably not been written about in a book, and even if it has I doubt that book is in my local library, and even if it is I doubt I can fucking find it

mirisgaiss@lemmy.world on 14 Jan 2026 19:34 collapse

obscure problems almost always have reddit comments as search results, and there’s no forums or blogs with any of it anymore. be nice to have around solely for that… though I’m sure if shit like /pics or whatever else was removed it could get significantly smaller…

SteveCC@lemmy.world on 13 Jan 2026 16:18 next collapse

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:20 collapse

thank you!!! i built on great ideas from others! i cant take all the credit 😋

Howlinghowler110th@kbin.earth on 13 Jan 2026 16:13 next collapse

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:21 collapse

thank you! the instruction are little overwhelming, check out the quickstart if you haven’t yet! github.com/19-84/redd-archiver/…/QUICKSTART.md

tanisnikana@lemmy.world on 13 Jan 2026 16:23 next collapse

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

19_84@lemmy.dbzer0.com on 13 Jan 2026 16:27 collapse

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! github.com/19-84/…/submit-data-source.yml

19_84@lemmy.dbzer0.com on 13 Jan 2026 17:40 next collapse

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

elbarto777@lemmy.world on 14 Jan 2026 02:59 next collapse

Anyone doing this will be banned in that platform.

Bazell@lemmy.zip on 14 Jan 2026 06:59 collapse

We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like “create your own AI Redditor” . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

avidamoeba@lemmy.ca on 13 Jan 2026 18:29 next collapse

How does this compare to redarc? It seems to be similar.

19_84@lemmy.dbzer0.com on 13 Jan 2026 18:42 collapse

redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

avidamoeba@lemmy.ca on 14 Jan 2026 15:37 collapse

Is there difference in how much storage space is needed between the two approaches?

19_84@lemmy.dbzer0.com on 14 Jan 2026 15:49 collapse

redd-archiver will take up more disk space because the database exists along with the static html

MedicPigBabySaver@lemmy.world on 13 Jan 2026 19:08 next collapse

Fuck Reddit and Fuck Spez.

muusemuuse@sh.itjust.works on 14 Jan 2026 01:24 collapse

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

elbarto777@lemmy.world on 14 Jan 2026 02:59 collapse

Where would it be hosted so that Conde Nast lawyers can’t touch it?

muusemuuse@sh.itjust.works on 14 Jan 2026 15:18 collapse

What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

limelight79@lemmy.world on 14 Jan 2026 21:06 next collapse

It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.

muusemuuse@sh.itjust.works on 15 Jan 2026 02:14 collapse

So distribute it and n a fault tolerant way. They can’t sue all of us.

elbarto777@lemmy.world on 15 Jan 2026 03:53 collapse

Oh I agree with you, friend. The problem is that they’ll say that they’re losing ad revenue. So they’ll try and sue, even if they’re in the wrong.

muusemuuse@sh.itjust.works on 15 Jan 2026 06:01 collapse

Fine, decentralize it then. And fuck your ad revenue, nobody likes you, Spez!

a1studmuffin@aussie.zone on 13 Jan 2026 21:40 next collapse

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

Tiger@sh.itjust.works on 13 Jan 2026 22:00 next collapse

What is the timing of the dataset, up through which date in time?

douglasg14b@lemmy.world on 13 Jan 2026 22:25 next collapse

It’s literally says in the link. Go to the link and it’s the title.

Tiger@sh.itjust.works on 14 Jan 2026 12:53 collapse

Oh I didn’t see it. I’m sorry I asked.

19_84@lemmy.dbzer0.com on 13 Jan 2026 23:00 collapse

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

Tiger@sh.itjust.works on 14 Jan 2026 12:54 collapse

Thank you very much, very cool.

mustlane@lemmy.zip on 13 Jan 2026 22:17 next collapse

This pajeet couldn’t even write this post without the help of LLMs, LMAO.

irmadlad@lemmy.world on 13 Jan 2026 22:27 collapse

spoiler

Maybe read where OP says ‘Yes I used AI, English is not my first language.’ Furthermore, are ethnic slurs really necessary here?

Cybersteel@lemmy.world on 14 Jan 2026 00:54 collapse

Then he’s no better than Reddit who also uses AI no?

irmadlad@lemmy.world on 14 Jan 2026 01:21 next collapse

How many languages do you know fluently? I get that people have a definite opinion about AI. Like I told another Lemmy user, I have a definite opinion about the ‘arr’ stack which conservatively, 75% of selfhosters run. However, you don’t hear me out here beating my tin pan at the very mention of the ‘arr’ stack. Why? Because I assume you all are autonomous adults, capable of making your own decisions. Secondly, wouldn’t that get a bit tedious and annoying over time? If you don’t like AI, don’t use it ffs. Why castigate individuals who use AI? What does that do? I would really like to know what denigrating and browbeating users who use AI accomplishes.

euAppleHater@feddit.org on 14 Jan 2026 17:22 collapse

Wait, do you have an issue with piracy in general or an issue with the arr attack specifically? No judgement or interest in argument, just genuinely curious. Feel free to dm if you don’t want to start a whole thing, or beat your tin pan as you said, in an unrelated post.

irmadlad@lemmy.world on 14 Jan 2026 18:08 collapse

Wait, do you have an issue with piracy in general

I don’t mind stating here: Piracy in general. I don’t condemn those who do because, as I’ve said, you are autonomous adults capable of making your own decisions. You know the risks and you take steps to mitigate those risks. You and I, have both heard all the pros and cons and all the supporting arguments of both sides. Now, I know there are lots of people who rip and catalog their own DVD, CDs, etc. All fine and dandy.

The comparison was that every time AI is used here in this comm, or even suspected of use, people have a conniption and start piling on. Like moths to a flame. What does that accomplish? Nothing. It seems to just make those who are anti-AI feel superior, is about all I can get from it. To me, it’s just a tool. I’ll grant you it’s a tool that needs some heavy regulation, even as much as I chafe against regulation. It is necessary. AI isn’t going away. It’s not a fad. It’s here to stay. If using AI makes your blood boil, fine. Don’t. Although I foresee a time where you’ll use AI and not even know it.

Opinions are great too. I, like others, have a long list of them. Stating your opinions is fine too. It seems here tho, opinions turn into castigation and denigration, which is in direct violation of ‘Rule 1: Be civil: we’re here to support and learn from one another.’ State your opinion on AI: ‘I’d rather guide my pops into my mum before I’d use AI’. Then move on. Personally, I don’t state my opinion on the arr stack, because it would accomplish nothing and in the long run become tedious and obnoxious.

As far as the arr stack as software, I’ve never deployed it, but it is pretty darn amazing from what I’ve read. The dev teams that have put it all together have some knowledge to say the least. It’s just not my bag.

euAppleHater@feddit.org on 14 Jan 2026 18:57 collapse

Ahk I see, thanks for the explanation. I assumed it was a general issue with piracy, but was wondering if maybe I had missed something negative about the software specially or the contributors behind it or something.

irmadlad@lemmy.world on 14 Jan 2026 22:35 collapse

IMHO, we can all co-exist under the selfhosting umbrella

elbarto777@lemmy.world on 14 Jan 2026 03:02 next collapse

I disagree. I don’t like AI slop. But he’s using AI here in a way that is very much intended. I want to share something in Mandarin, I don’t know Mandarin. If only there was a way to transform my thoughts into Mandarin…

pixeltree@lemmy.blahaj.zone on 14 Jan 2026 17:43 collapse

Using ai to help normal everyday people cross language barriers is one of the few good ethical uses for it. I hate ai and it’s implications as mich as the next gal but this is clearly fine

BigDiction@lemmy.world on 14 Jan 2026 03:11 next collapse

You should be very proud of this project!! Thank you for sharing.

usernameusername@sh.itjust.works on 14 Jan 2026 03:15 next collapse

so kinda like kiwix but for reddit. That is so cool

offspec@lemmy.world on 14 Jan 2026 05:53 next collapse

It would be neat for someone to migrate this data set to a Lemmy instance

cyberpunk007@lemmy.ca on 14 Jan 2026 07:04 next collapse

Now this is a good idea.

TeddE@lemmy.world on 14 Jan 2026 17:25 next collapse

It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.

Olgratin_Magmatoe@slrpnk.net on 14 Jan 2026 18:02 next collapse

Might be easiest to set up an instance in a country that doesn’t give a fuck about western IP law, then others can federate to it.

So yeah, fanatic levels of effort.

fennesz12@feddit.dk on 14 Jan 2026 18:31 next collapse

Brb, setting up a Lemmy server in Red Star OS

MonkeMischief@lemmy.today on 14 Jan 2026 19:50 collapse

(The machine with the only Steam account active in North Korea ~~would like to~~ already knows your location)

A_Random_Idiot@lemmy.world on 14 Jan 2026 22:52 collapse

The chances are pretty high that is probably Kims computer, arent they?

MonkeMischief@lemmy.today on 16 Jan 2026 16:41 collapse

I think we were all hoping that some loveable genius was quietly subverting their surveillance state and getting a view of the outside world via Team Fortress 2, but, yeah, if it’s not North Korea’s fattest man, it’s probably a high ranking military crony.

. . .Hey just musing here but that sounds like a kinda hilariously easy doxx. You don’t think they’d keep state secrets on that same machine? . . . Surely. . .? Noooo. . . 🤔

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:42 next collapse

this is one reason i support tor deployment out of the box 😋

floquant@lemmy.dbzer0.com on 14 Jan 2026 22:22 collapse

Post and comments are not Reddit’s IP anyway :3

Buddahriffic@lemmy.world on 14 Jan 2026 23:20 next collapse

They might have set up the user agreement for it. Stackexchange did and their whole business model was about catching businesses where some worker copy/pasted code from a stackexchange answer and getting a settlement out of it.

I agree with you in principle (hell, I’d even take it further and think only trademarks should be protected, other than maybe a short period for copyright and patent protection, like a few years), but the legal system might disagree.

Edit: I’d also make trademarks non-transferrable and apply to individuals rather than corporations, so they can go back to representing quality rather than business decisions. Especially when some new entity that never had any relation to the original trademark user just throws some money at them or their estate to buy the trust associated with the trademark.

Olgratin_Magmatoe@slrpnk.net on 15 Jan 2026 00:04 collapse

/u/Buddahriffic put it better than I could.

I agree, it shouldn’t be reddit’s intellectual property. But the law binds the poor and protects the rich.

floquant@lemmy.dbzer0.com on 14 Jan 2026 22:19 collapse

Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can “own it” just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

Of course it would be dumb for someone under US jurisdiction but we’ll see how much an international DMCA claim is worth considering the current relations anyway.

TeddE@lemmy.world on 15 Jan 2026 00:07 collapse

They don’t own it, the individual posters own the content of their own posts, however, from the reddit terms of service:

When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.

And with each of those rights granted, Reddit’s lawyers can defend those rights. So no, they don’t own it “just because they ran the servers” - they own specific rights to copy granted to them by each poster.

(I don’t like this arrangement, but ignorance of the terms of service isn’t going to help someone who uploaded a full copy of the works they have extensive rights to) On this subject I think there needs to be an extensive overhaul to narrow what terms you can extend to the general public. The problem is I straight up don’t trust anyone currently in power to make such a change to have our interests in mind.

Mavytan@feddit.nl on 15 Jan 2026 07:35 collapse

I’m not at all familiar with legalese, but wouldn’t ‘non-exclusive’ in that statement mean that you, and others permitted by you, can redistribute the content as you see fit? Meaning that copying and redistributing reddit content doesn’t necessarily violate reddit’s terms of service but does violate the user’s copyright?

tatterdemalion@programming.dev on 15 Jan 2026 11:00 collapse

Yeah so at worst you could get sued by some random reddit users that don’t want their post history hosted on your site.

Given how little traction artists and authors have had with suing AI companies for blatant copyright infringement, I kinda doubt it would go anywhere.

JackbyDev@programming.dev on 15 Jan 2026 11:24 collapse

Lemmit already existed and was annoying as hell. It was the first account I remember blocking.

1984@lemmy.today on 14 Jan 2026 07:09 next collapse

I dont know if historic data is very interesting. Its the new content we are interested in…

Smash@lemmy.self-hosted.site on 14 Jan 2026 10:37 next collapse

You are interested in AI slop and bot propaganda?

partofthevoice@lemmy.zip on 14 Jan 2026 19:57 collapse

*Bot propaganda?

Smash@lemmy.self-hosted.site on 14 Jan 2026 20:48 collapse

Sag ich doch 😁

NotMyOldRedditName@lemmy.world on 14 Jan 2026 19:46 collapse

You’ve never been looking for an answer to something and it’s on an old reddit post?

How to beat a boss in a game?

Ways to do some sort of home improvement?

Discussion on a product you’re considering that isn’t brand new?

Appoxo@lemmy.dbzer0.com on 14 Jan 2026 11:23 next collapse

People will do anything to use Reddit instead of just letting go.

communism@lemmy.ml on 14 Jan 2026 12:43 collapse

This is just an archive. No different from using the wayback machine or any other archive of web content.

Appoxo@lemmy.dbzer0.com on 14 Jan 2026 16:53 collapse

You still use Reddit in some capacity.

Or would you deny watching a movie just because you watched it on your local Jellyfin folder instead of watching it on Netflix or the cinema?

MutilationWave@lemmy.dbzer0.com on 14 Jan 2026 17:16 next collapse

There is a ton of useful info on Reddit. I don’t use it anymore either but I’ll be downloading this project.

Appoxo@lemmy.dbzer0.com on 14 Jan 2026 17:54 collapse

I never said I am not using it.
But that feels like it’s a compromise to keep using it as native as possible.
If it was just for research purposes, accessing archive.org would suffice.

MutilationWave@lemmy.dbzer0.com on 14 Jan 2026 19:02 collapse

I think the idea here is to have it offline in the event of further fascist control of the internet. There is really so much useful information on there on a wide variety of topics. I don’t care about backing up memes and bot drivel.

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:31 collapse

that was exactly the idea, thanks for understanding…

also reddit’s ban on vpn also reddit’s mandatory id verification

and the list goes on…

pixeltree@lemmy.blahaj.zone on 14 Jan 2026 17:33 collapse

“Stop talking to my clone, I specifically requested you never contact me again”

It’s an archive of reddit, not reddit

K3can@lemmy.radio on 14 Jan 2026 12:22 next collapse

Can anyone figure out what the minimum process is to just use the SSG function? I’m having a really hard time trying to understand the documentation.

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:41 collapse

did you check the quickstart?

K3can@lemmy.radio on 15 Jan 2026 01:36 collapse

Yes, both the standalone quickstart and the quickstart section of the readme (which are both different).

Is it possible to get the static sites without spinning up a DB backend?

19_84@lemmy.dbzer0.com on 15 Jan 2026 02:14 collapse

the database is required

lautan@lemmy.ca on 14 Jan 2026 12:41 next collapse

Thanks. This is great for mining data and urls.

Clbull@lemmy.world on 14 Jan 2026 18:31 next collapse

Eww, Voat and Ruqqus.

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:42 next collapse

i will always take more data sources, including lemmy!

polarity_inverter@startrek.website on 14 Jan 2026 22:45 collapse

… for building your personal Grok?

19_84@lemmy.dbzer0.com on 14 Jan 2026 23:44 collapse

if you didn’t notice, this project was released into the public domain

polarity_inverter@startrek.website on 17 Jan 2026 21:50 collapse

The Free/Libre Torment Nexus

sj_zero@lotide.fbxl.net on 17 Jan 2026 21:37 collapse

I'd be worried about having some of the voat stuff on a hard drive I own.

I'm surprised GitHub hasn't automatically nixed the archive.

vane@lemmy.world on 14 Jan 2026 21:26 next collapse

How long it takes to download this 3TB torrent ?

19_84@lemmy.dbzer0.com on 14 Jan 2026 21:40 collapse

week(s)

vane@lemmy.world on 14 Jan 2026 21:43 collapse

Thank you for answer. I think I do this one instead academictorrents.com/…/30dee5f0406da7a353aff6a8ca… Looks like it’s divided by year-month.

19_84@lemmy.dbzer0.com on 14 Jan 2026 22:29 collapse

those are not split by subreddit so they will not work with the tool

Butterphinger@lemmy.zip on 15 Jan 2026 05:29 next collapse

grabs external

Mubelotix@jlai.lu on 15 Jan 2026 07:43 next collapse

I do not consent for this

inspxtr@lemmy.world on 15 Jan 2026 08:00 next collapse

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

ICastFist@programming.dev on 15 Jan 2026 12:04 next collapse

What’s the size difference when you remove the porn stuff from the torrent?

spicehoarder@lemmy.zip on 15 Jan 2026 13:30 collapse

Willing to bet a 90% size reduction

HugeNerd@lemmy.ca on 15 Jan 2026 12:22 next collapse

Boring. I want the Kuro5hin site. That was actually good and hysterically funny at the best times. ASCII reenactment players of Michael Crawford anyone?

Seefoo@lemmy.world on 20 Jan 2026 21:05 collapse

Does this decompress the files preemptively and leave them? Or is it only decompressing as a post/subreddit is accessed? Basically i am wondering what kind of storage footprint would be required to search through this