Experiences with zfs deduplication?
from needanke@feddit.org to selfhosted@lemmy.world on 09 Jan 22:46
https://feddit.org/post/6663267

I recently moved my files to a new zfs-pool and used that chance to properly configure my datasets.

This led me to discovering zfs-deduplication.

As most of my storage is used by my jellyfin library (~7-8Tb), which is mostly uncompressed bluray rips I thought I might be able to save some storage using deduplication in addition to compression.

Has anyone here used that for similar files before? What was your experience with it?

I am not too worried about performance. The dataset in question is rarely changed. Basically only when I add more media every couple of months. I also have overshot my cpu-target when originally configuring my server so there is a lot of headroom there. I have 32Gb of ram which is not really fully utilized either (but I also would not mind upgrading to 64 too much).

My main concern is that I am unsure it is useful. I suspect just because of the amount of data and similarity in type there would statistically be a lot of block-level duplication but I could not find any real world data or experiences on that.

#selfhosted

threaded - newest

pimeys@lemmy.nauk.io on 09 Jan 22:54 next collapse

You should maybe read about the use cases for deduplication before using it. Here’s one recent article:

despairlabs.com/…/2024-10-27-openzfs-dedup-is-goo…

If you mostly store legit Blu-ray rips, the answer is probably no, you should not use zfs deduplication.

friend_of_satan@lemmy.world on 09 Jan 23:23 next collapse

I was also going to link this. I started using zfs 10-ish years ago and used dedup when it came out, and it was really not worth it except for archiving a bunch of stuff I knew had gigs of duplicate data. Performance was so poor.

undefined@lemmy.hogru.ch on 10 Jan 06:55 collapse

I’m in almost the exact same situation as OP, 8 TB of raw Blu-ray dumps except I’m on XFS. I ran duperemove and freed ~200 GB.

needanke@feddit.org on 11 Jan 20:26 collapse

I think I was a bit unclear on that, I meant uncompressed rips as in I ripped the relevant media to unkompressed mkvs, I didn’t save the entire disk dump. I also have mostly such rips, but also a bit of media from other sourches ™ which is already compressed. So I suspect my results would be even worse.

undefined@lemmy.hogru.ch on 11 Jan 21:17 collapse

I agree. Most of my duplicates came from the raw disc files. I too dump some content to MKV (mainly TV episodes) but those files likely have much less duplication, though I do recall some of the duplicates coming from The Office in MKV.

(I do wonder if those The Office duplicates were something like the opening title, or scenes from the episode showing clips from previous episodes because it seems highly unlikely that the raw video streams were similar.)

AbouBenAdhem@lemmy.world on 09 Jan 23:01 next collapse

I haven’t tried it because I’ve read a lot of negative discussions of it—and because (by my understanding) the only reasonable use case would be if there were a large number of users and each user is likely to have copies of the same files but don’t want to expose their files to each other (so you can’t just manually de-dupe).

[deleted] on 09 Jan 23:09 collapse

.

nottelling@lemmy.world on 09 Jan 23:05 next collapse

ZFS dedup is memory constrained, and the memory use scales with the block hashes.

If performance isn’t a concern, you’re better off compressing your media. You’ll get similar storage efficiency with less crash consistency risk.

IsoKiero@sopuli.xyz on 09 Jan 23:40 collapse

ZFS in general is pretty memory hungry. I set up my proxmox sever with zfs pools a while ago and now I kind of regret it. ZFS in itself is very nice and has a ton of useful features, but I just don’t have the hardware nor the usage pattern to benefit from it that much on my server. I’d rather have that thing running on LVM and/or software raid to have more usable memory for my VM’s. And that’s one of the projects I’ve been planning for the server, replace zfs pools with something which suits my usage patterns better, but that’s a whole another story and requires some spare money and some spare time, which I don’t really either at hand right now.

emptiestplace@lemmy.ml on 11 Jan 08:47 collapse

Just adjust it if you actually need the RAM and it isn’t relinquishing quickly enough.

options zfs zfs_arc_max=17179869184 in /etc/modprobe.d/zfs.conf, update-initramfs -u, reboot - this will limit ZFS ARC to 16GiB.

arc_summary to see what it’s using now.

As for using a simple fs on LVM, do you not care about data integrity?

IsoKiero@sopuli.xyz on 11 Jan 16:46 collapse

this will limit ZFS ARC to 16GiB.

But if I have 32GB to start with, that’s still quite a lot and, as mentioned, my current usage pattern doesn’t really benefit from zfs over any other common filesystem.

As for using a simple fs on LVM, do you not care about data integrity?

Where you get that from? LVM has options to create raid volumes and, again as mentioned, I can mix and match those with software raid however I like. Also, single host, no matter how sophisticated filesystems and raid setups, doesn’t really matter when talking about keeping data safe, that’s what backups are for and it’s a whole another discussion.

Decipher0771@lemmy.ca on 09 Jan 23:27 next collapse

I think the universal consensus is that outside of a very specific use case: multiple VDI desktops that share the same image, ZFS dedupe is completely useless at best and will destroy your dataset at worst by causing to be unmountable on any system that has less RAM than needed. In every other use case, the savings are not worth the trouble.

Even in the VDI use case, unless you have MANY copies of said disk images(like 5+ copies of each), it’s still not worth the increase in system resources needed to use ZFS dedupe.

It’s one of those “oooh shiny” nice features that everyone wants to use, but will regret it nearly every time.

greyfox@lemmy.world on 10 Jan 01:44 next collapse

Like most have said it is best to stay away from ZFS deduplication. Especially if your data set is media the chances of an entire ZFS block being the same as any other is small unless you somehow have multiple copies of the same content.

Imagine two mp3s with the exact same music content but with slightly different artist metadata. A single bit longer or shorter at the beginning of the file and even if the file spans multiple blocks ZFS won’t be able to duplicate a single byte. A single bit offsetting the rest of the file just a little is enough to throw off the block checksums across every block in the file.

To contrast with ZFS, enterprise backup/NAS appliances with deduplication usually do a lot more than block level checks. They usually check for data with sliding window sizes/offsets to find more duplicate data.

There are still some use cases where ZFS can help. Like if you were doing multiple full backups of VMs. A VM image has a fixed size so the offset issue above isn’t an issue, but if beware that enabling deduplication for even a single ZFS filesystem affects the entire pool, even ZFS filesystems that have deduplication disabed. The deduplication table is global for the pool and once you have turned it on you really can’t get rid of it. If you get into a situation where you don’t have enough memory to keep the deduplication table in memory ZFS will grind to a halt and the only way to completely remove deduplication is to copy all of your data to a new ZFS pool.

If you think this feature would still be useful for you, you might want to wait for 2.3 to release (which isn’t too far off) for the new fast dedup feature which fixes or at least prevents a lot of the major issues with ZFS dedup

More info on the fast dedup feature here github.com/openzfs/zfs/discussions/15896

BCsven@lemmy.ca on 10 Jan 02:22 next collapse

Something like fslint might be what you want. Scan folders, lists duplicates, you set how you want to deal with them. Its more manual, but I think it is what you are actually trying to achieve.

Moonrise2473@feddit.it on 10 Jan 08:14 next collapse

Uncompressed Blu-ray rips are almost the same size when compressed with lossless compression. The binary content of h264 files are almost random bits so deduplication is almost a waste of CPU time. Maybe you can save space from the useless repeated media like trailers and other ads in the bluray isos

needanke@feddit.org on 11 Jan 20:17 collapse

Thank you, I didn’t know that, but that was the type of answer I was looking for :D.

I only ripped the relevant media anyways, so no trailers to remove.

ShawiniganHandshake@sh.itjust.works on 11 Jan 15:14 collapse

I worked with dedupe products at a previous job. Media files generally deduplicate poorly.