Zpool scrub taking days? And HDD issues... Am I cooked?
from Imaginary_Stand4909@lemmy.blahaj.zone to selfhosted@lemmy.world on 06 May 2026 06:03
https://lemmy.blahaj.zone/post/42307518

So I was trying to download a torrent (while seeding like 5 others) when I noticed my rates just kept gradually falling to 0B upload/download until spiking back up to 1-2MB before falling again. I check my Proxmox SMART test of my drives and then it shows one disk was degraded. When I try to view the overall “disks” tab in Proxmox it just times out and shows an error [communication failure (0)]

So I try to do a zpool scrub tank_name, which started Monday May 4 22:02:21 2026…

While scrubbing the checksum errors on the online repairing disk (wwn-0x5000c5004d033fc1) just keep climbing… I made the degraded disk go offline. Here’s the current status of zpool status tank_name:

root@nova:~# zpool status Orico2tera4
  pool: Orico2tera4
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon May  4 22:02:21 2026
        3.53G / 378G scanned at 36.9K/s, 3.47G / 378G issued at 36.3K/s
        9.61M repaired, 0.92% done, no estimated completion time
config:

        NAME                                              STATE     READ WRITE CKSUM
        Orico2tera4                                       DEGRADED     0     0     0
          mirror-0                                        ONLINE       0     0     0
            ata-ST2000NM0011_Z1P2D6SC                     ONLINE       0    13     1
            usb-External_USB3.0_DISK01_20170331000C3-0:1  ONLINE       0     0     3  (repairing)
          mirror-1                                        DEGRADED     0     1     0
            wwn-0x5000c500357c0b91                        OFFLINE      0     0    21
            wwn-0x5000c5004d033fc1                        ONLINE       0     1 2.00K  (repairing)

errors: 49 data errors, use '-v' for a list

I haven’t used these disks for super long, it’s only been about 5 months of my homelab actually being used, and I wasn’t doing constant torrenting until February. The disks are refurbished, 2TB each, and they’re stored in a USB connected drive bay. my usage is pretty low, just 432.80 GB of 4TB (11.13%)

I’ve looked at my snapshots with zfs list -t snapshot, not sure when I should try to restore from a snap, but I’ve never done it before. I’ll make sure to take backups more seriously from now on, don’t be me…

Update:

Turned off the machine and bay, realized it had shit ventilation and that the drives were pretty hot, let it cool and gave everything a quick dust down. Nothing seemed to be bad or visibly fucked up?

After letting it chill out for about 2-3 hours I put the drive bay in a better vented spot and did a scrub, then resilvered the drive, then did another scrub. About to do some SMART tests.

Here’s zpool status -v:

zpool status -v Orico2tera4
  pool: Orico2tera4
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:56:51 with 0 errors on Wed May  6 23:37:43 2026
config:

        NAME                                              STATE     READ WRITE CKSUM
        Orico2tera4                                       ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            ata-ST2000NM0011_Z1P2D6SC                     ONLINE       0     0   199
            usb-External_USB3.0_DISK01_20170331000C3-0:1  ONLINE       0     0   125
          mirror-1                                        ONLINE       0     0     0
            wwn-0x5000c500357c0b91                        ONLINE       0     0   100
            wwn-0x5000c5004d033fc1                        ONLINE       0     0   462

errors: No known data errors

And then it again after a clear:

zpool status -v Orico2tera4 
  pool: Orico2tera4
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:57:18 with 0 errors on Thu May  7 01:28:30 2026
config:

        NAME                                              STATE     READ WRITE CKSUM
        Orico2tera4                                       ONLINE       0     0     0
          mirror-0                                        ONLINE       0     0     0
            ata-ST2000NM0011_Z1P2D6SC                     ONLINE       0     0     0
            usb-External_USB3.0_DISK01_20170331000C3-0:1  ONLINE       0     0     0
          mirror-1                                        ONLINE       0     0     0
            wwn-0x5000c500357c0b91                        ONLINE       0     0     0
            wwn-0x5000c5004d033fc1                        ONLINE       0     0     0

errors: No known data errors
root@nova:~#

What have we learned?

Do biweekly scrubs
Put your drives in a not shit location
Do trims like, once a month maybe
Make way more frequent snapshots
Backup your shit!!! NOW!!! To literally anywhere else but just do it!!!

#selfhosted

threaded - newest

androidul@lemmy.world on 06 May 2026 06:21 next collapse

yikes!

How often were you running scrub & trim?

Imaginary_Stand4909@lemmy.blahaj.zone on 06 May 2026 16:52 collapse

I did do a scrub in the past when I had an issue with accidentally pulling the bay power cord a few months ago, but I genuinely forgot to run these weekly. I have no idea what trimming is…

mlfh@lm.mlfh.org on 06 May 2026 07:19 next collapse

You have enough failures on each disk to make me suspect an issue with the usb-connected drive bay. I ran into similar issues with a cheap pci-e sata adapter, where little hiccups and latency in the communication layer would cause zfs to take disks offline randomly. Read, write, and checksum errors would slowly accumulate across all of the disks. Switched that machine to a proper enterprise hba, the issues vanished, and the disks are all healthy 3-4 years later.

Dran_Arcana@lemmy.world on 06 May 2026 11:40 collapse

+1 to this observation. I run zfs arrays at both home and work and it’s way more likely that your controller is flaking than you have that many simultaneous drive failures.

The unfortunate reality though is that you can’t trust the current copy of this data, even the snapshots, unless the restore passes a scrub post-restore.

xep@discuss.online on 06 May 2026 07:52 next collapse

As another poster mentioned, I’d suspect the drive bay. Those things aren’t known for being reliable.

desentizised@lemmy.zip on 06 May 2026 10:47 next collapse

I’ve tried recycling a USB-based 5-bay enclosure (which I previously used in hardware raid mode) for my unraid-based backup and even in that lower criticality use-case it was an absolute showstopper. Its kind of a shame that USB seemingly can’t be used this way. It would make redundant data storage so much more affordable.

suzune@ani.social on 06 May 2026 11:28 collapse

It can run for days and puts additional strain on the hardware.

Check the physical attachment and obvious hardware failures first.

In case the hardware seems fine, try to zfs send the most important data to a safe place.