lvm – mdadm raid5 random read errors. Diskette die?


First the long story:
I have a RAID5 with mdadm on Debian 9. The RAID has 5 hard drives, each with 4 TB size. Four of them are HGST Deskstar NAS, one that came later is a Toshiba N300 NAS.

In the last few days, I've noticed some read errors from this raid. For example, I had a rare 10 GB archive in several parts. When I try to extract, I get CRC errors on some parts. If I try it a second time, I get these errors elsewhere. This also happens with torrents and a re-chack after the download.

After rebooting, my BIOS noticed that the S.M.A.R.T status of an HGST drive on SATA port 3 is bad. smartctl had told me that there are DMA CRC errors, but claims that the drive is fine.

Another reboot later, I can not see the crc errors in the Smart anymore. But now I get this edition

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-4-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== READ THE SMART DATA SECTION ===
SMART self-assessment test for overall health: NOT PASSED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed attributes:
ID # ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 001 001 005 Pre-Fail Always FAILING_NOW 1989

Since the HGST is no longer available at normal prices, I bought another Toshiba N300 to replace the HGST. Both are marked with 4 TB. I tried to create a partition of the same size, but it did not work. The partition program claimed that my number was too big (I tried bytes and sectors). So I made the partition as big as possible. But now it looks like it's the same size, I'm a bit confused.

SDC is the old and sdh is the new one

Plate / dev / sdc: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical / physical): 512 bytes / 4096 bytes
I / O Size (Minimum / Optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk ID: 4CAD956D-E627-42D4-B6BB-53F48DF8AABC

End type of the device start types
/ dev / sdc1 2048 7814028976 7814026929 3.7T Linux RAID


Plate / dev / sdh: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical / physical): 512 bytes / 512 bytes
I / O Size (Minimum / Optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Plate ID: 3A173902-47DE-4C96-8360-BE5DBED1EAD3

End type of the device start types
/ dev / sdh1 2048 7814037134 7814035087 3.7T Linux file system

Currently I've added the new one as a replacement disk. The RAID still works with the old drive. I still have read errors, especially with large files.

This is what my RAID looks like:

/ dev / md / 0:
Version: 1.2
Creation time: Sun 17 Dec 22 22:03:20 2017
Raid level: raid5
Array size: 15627528192 (14903.57 GiB 16002.59 GB)
Used Dev Size: 3906882048 (3725.89 GiB 4000.65 GB)
Raid devices: 5
Complete devices: 6
Persistence: Superblock is persistent

Intentional bitmap: internally

Update time: Saturday, January 5th, 09:48:49 clock 2019
Condition: clean
Active devices: 5
Working tools: 6
Failed devices: 0
Replacement devices: 1

Layout: left symmetric
Piece size: 512K

Name: SERVER: 0 (local to the host SERVER)
UUID: 16ee60d0: f055dedf: 7bd40adc: f3415deb
Events: 25839

Number Major Minor RaidDevice State
0 8 49 0 active sync / dev / sdd1
1 8 33 1 active synchronization / dev / sdc1
3 8 1 2 active sync / dev / sda1
4 8 17 3 active sync / dev / sdb1
5 8 80 4 active sync / dev / sdf

6 8 113 - spare / dev / sdh1

And the plate structure is this

NAME MAJ: MIN RM SIZE RO TYPE MOUNTAIN POINT
sda 8: 0 0 3,7T 0 plate
└─sda1 8: 1 0 3,7T 0 part
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume
sdb 8:16 0 3,7T 0 disk
└─sdb1 8:17 0 3,7T 0 part
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume
sdc 8:32 0 3,7T 0 disk
└─sdc1 8:33 0 3,7T 0 part
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume
sdd 8:48 0 3,7T 0 plate
└─sdd1 8:49 0 3,7T 0 parts
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume
sdf 8:80 1 3.7T 0 floppy disk
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume
sdh 8: 112 1 3.7T 0 floppy disk
└─sdh1 8: 113 1 3,7T 0 part
Md0 9: 0 0 14,6T 0 raid5
└─storageRaid 253: 4 0 14,6T 0 crypt
└─vg_raid-raidVolume 253: 5 0 14,6T 0 lvm / media / raidVolume

I am a little confused that the spare hard drive (SDH) is already in the encryption volume.

Ask:
By what criteria does mdadm say that a hard disk has failed?
Can the random read errors come from a defective hard drive?
Do not recognize the raid when a disk sends the wrong data?
Is it dangerous to mark a plate manually as faulty if the replacement plate is not exactly the same size?