Introduction

In the past I’ve made a post about the unstability of the AsRock C2750D4I. Guess what, problems aren’t gone with this motherboard.
I am suspecting the RAID controller of the motherboard. When the server experiences heavy load, at least two disks disconnect, bringing down the software RAID.

Troubleshooting

Let’s start by finding out the disk layout of my RAID5.

cat /proc/mdstat
md5 : inactive sde1[3](S) sdh1[1](S) sdf1[0](S) sdd1[4](S)
      11720536064 blocks super 1.2

This shows that my RAID is spread across sde1, sdh1, sdf1 and sdd1. The last error logs from dmesg showed my that sdh1 went down and sdf1 went down before the RAID crash.

So let’s try to find some more information about these two crashed drives.

sudo lshw -c disk

The result will show you a little bit more information about each drive.

  *-disk
       description: ATA Disk
       product: SAMSUNG HD103SI
       physical id: 0.0.0
       bus info: scsi@2:0.0.0
       logical name: /dev/sda
       version: 1AG0
       serial: S20XJDWS700323
       size: 931GiB (1TB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 sectorsize=512 signature=0007f8a5
  *-disk
       description: ATA Disk
       product: SAMSUNG HD103SI
       physical id: 0.0.0
       bus info: scsi@3:0.0.0
       logical name: /dev/sdb
       version: 1AG0
       serial: S20XJDWZ118279
       size: 931GiB (1TB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 sectorsize=512 signature=00071895
  *-disk
       description: ATA Disk
       product: KINGSTON SVP200S
       physical id: 0.0.0
       bus info: scsi@5:0.0.0
       logical name: /dev/sdc
       version: 502A
       serial: 50026B7331033DD9
       size: 55GiB (60GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 sectorsize=512 signature=91a29a16
  *-disk
       description: ATA Disk
       product: ST3000DM001-1CH1
       vendor: Seagate
       physical id: 0.0.0
       bus info: scsi@6:0.0.0
       logical name: /dev/sdd
       version: CC24
       serial: Z1F27VHM
       size: 2794GiB (3TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=0556e5e5-1e62-42f4-a89c-29813a6f4a18 sectorsize=4096
  *-disk
       description: ATA Disk
       product: Hitachi HDS5C303
       vendor: Hitachi
       physical id: 0.0.0
       bus info: scsi@7:0.0.0
       logical name: /dev/sde
       version: MZ6O
       serial: MCE9215Q0B5MLW
       size: 2794GiB (3TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=ec9054e2-94c3-4d74-8fea-2d34ce0b92ac sectorsize=4096
  *-disk
       description: ATA Disk
       product: Hitachi HDS5C303
       vendor: Hitachi
       physical id: 0.0.0
       bus info: scsi@8:0.0.0
       logical name: /dev/sdf
       version: MZ6O
       serial: MCE9215Q0BHTDV
       size: 2794GiB (3TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=2f6f5a9b-441e-467d-861c-852e2bdefb5e sectorsize=4096
  *-disk
       description: ATA Disk
       product: WDC WD40EFRX-68W
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@9:0.0.0
       logical name: /dev/sdg
       version: 80.0
       serial: WD-WCC4E1653628
       size: 3726GiB (4TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=4ac4a5a9-ccd1-42c5-907a-9272c076a15c sectorsize=4096
  *-disk
       description: ATA Disk
       product: TOSHIBA DT01ACA3
       vendor: Toshiba
       physical id: 0.0.0
       bus info: scsi@10:0.0.0
       logical name: /dev/sdh
       version: MX6O
       serial: 63NZKNRKS
       size: 2794GiB (3TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=24069398-46d0-4b01-9e8e-2530cb9f1cf8 sectorsize=4096

The logical name field shows that my Toshiba (sdh) drive and my Hitachi drive (sdf) where impacted by the last drive/SATA error on the board. This information can be used to physically track the SATA cables to the correct drive.
So now that we have the disk names, we need to find out which controller is the culprit for throwing these errors.

First let’s identify the bus addresses of all SATA controllers available on the motherboard.

sudo lshw -c storage

The connected controller in the bus info field.

  *-storage
       description: SATA controller
       product: 88SE9172 SATA 6Gb/s Controller
       vendor: Marvell Technology Group Ltd.
       physical id: 0
       bus info: pci@0000:04:00.0
       version: 11
       width: 32 bits
       clock: 33MHz
       capabilities: storage pm msi pciexpress ahci_1.0 bus_master cap_list rom
       configuration: driver=ahci latency=0
       resources: irq:55 ioport:c040(size=8) ioport:c030(size=4) ioport:c020(size=8) ioport:c010(size=4) ioport:c000(size=16) memory:df410000-df4101ff memory:df400000-df40ffff
  *-storage
       description: SATA controller
       product: 88SE9230 PCIe SATA 6Gb/s Controller
       vendor: Marvell Technology Group Ltd.
       physical id: 0
       bus info: pci@0000:09:00.0
       version: 11
       width: 32 bits
       clock: 33MHz
       capabilities: storage pm msi pciexpress ahci_1.0 bus_master cap_list rom
       configuration: driver=ahci latency=0
       resources: irq:56 ioport:d050(size=8) ioport:d040(size=4) ioport:d030(size=8) ioport:d020(size=4) ioport:d000(size=32) memory:df610000-df6107ff memory:df600000-df60ffff
  *-storage:0
       description: SATA controller
       product: Atom processor C2000 AHCI SATA2 Controller
       vendor: Intel Corporation
       physical id: 17
       bus info: pci@0000:00:17.0
       version: 02
       width: 32 bits
       clock: 66MHz
       capabilities: storage msi pm ahci_1.0 bus_master cap_list
       configuration: driver=ahci latency=0
       resources: irq:48 ioport:e0d0(size=8) ioport:e0c0(size=4) ioport:e0b0(size=8) ioport:e0a0(size=4) ioport:e040(size=32) memory:df762000-df7627ff
  *-storage:1
       description: SATA controller
       product: Atom processor C2000 AHCI SATA3 Controller
       vendor: Intel Corporation
       physical id: 18
       bus info: pci@0000:00:18.0
       version: 02
       width: 32 bits
       clock: 66MHz
       capabilities: storage msi pm ahci_1.0 bus_master cap_list
       configuration: driver=ahci latency=0
       resources: irq:54 ioport:e090(size=8) ioport:e080(size=4) ioport:e070(size=8) ioport:e060(size=4) ioport:e020(size=32) memory:df761000-df7617ff

Now for each driven we can search the corresponding SATA controller address, this is listed as the pci values found above.

sudo udevadm info -q all -n /dev/sde | grep DEVPATH
E: DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:01.0/0000:04:00.0/ata8/host7/target7:0:0/7:0:0:0/block/sde
 
sudo udevadm info -q all -n /dev/sdd | grep DEVPATH
E: DEVPATH=/devices/pci0000:00/0000:00:03.0/0000:02:00.0/0000:03:01.0/0000:04:00.0/ata7/host6/target6:0:0/6:0:0:0/block/sdd
 
sudo udevadm info -q all -n /dev/sdf | grep DEVPATH
E: DEVPATH=/devices/pci0000:00/0000:00:04.0/0000:09:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdf
 
sudo udevadm info -q all -n /dev/sdh | grep DEVPATH
E: DEVPATH=/devices/pci0000:00/0000:00:04.0/0000:09:00.0/ata11/host10/target10:0:0/10:0:0:0/block/sdh

The last number before /ata is the device which it is connected to. So this means that sde and sdd are connected to an ATA device at 0000:04:00.0 which equals to the Marvell 88SE9172 SATA 6Gb/s Controller.
The drives sdf and sdh are connected to the ATA device at 0000:09:00.0, which translates to the Marvell 88SE9230 PCIe SATA 6Gb/s Controller.

Which is the asshole throwing me errors.

Now with these information we can unplug the disks, from the controller throwing the errors. The location of the Marvell 88SE9230 is explained in the manual at http://www.asrockrack.com/general/productdetail.asp?Model=C2750D4I#Manual. You can verify the physical existence on the board, together with the disk names found previously.

So I rerouted all disks (I prefer 3Gbps SATA above a dysfunctional 6Gbps any day) and since then the NAS has been stable.