My ZFS mirror pool got some checksum errors. I replaced the controller, thinking that was the most likely cause, but the errors won’t clear. pool clear temporarily resets them, but they come back the next time I run a scrub. How can I clear them for good?
I have had a ZFS mirror-0 set up and running on ubuntu 20.04.2 LTS for some time. When one of the drives died, I took advantage of the failure to replace both drives with larger ones, as well as adding a SATA-III PCI card for the new drives (the old ones had been connected to the on-board SATA II controller, as I had no more SATA III ports available). After running on the new drives and controller for a few weeks, ZFS complained about checksum errors on both new drives, and put the array into a “degraded” state as a result.
Some research led me to the conclusion that since both drives were showing the exact same number of checksum errors, it was much more likely to be an issue with the controller than with the drives themselves. So I pulled the new controller and put the drives back on the onboard SATA II controller for now, intending to replace the controller card once I verify that is the issue. I then deleted the two files that
zpool status -v showed as having permanent errors, issued a
zpool clear data to reset the errors, and ran a scrub.
Unfortunately, after the scrub the errors re-appeared, only now a
-v no longer showed a file, but just the address (inode, I believe), presumably for one of the files I had deleted earlier. I tried again, with the same result. Every time I run a scrub, it comes back with the following result:
root@watchman:~# zpool status -v pool: data state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 16K in 0 days 09:10:20 with 1 errors on Sat Jul 24 15:48:21 2021 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ata-ST8000VE000-2P6101_WSD1M5NW DEGRADED 0 0 15 too many errors ata-ST8000VE000-2P6101_WSD1HEJX DEGRADED 0 0 15 too many errors errors: Permanent errors have been detected in the following files: data:<0x380508>
From what I can tell, this is just the same issue that already existed due, presumably, to the bad controller, but I can’t seem to clear it out. How can I restore my mirror to a fully-functioning state?