Introduction
Software RAID (md) is the backbone of self-hosted storage infrastructure, providing redundancy and performance for everything from home NAS devices to enterprise database servers. But RAID arrays don’t protect themselves — they need proactive health monitoring to detect failing drives, degraded arrays, and silent data corruption before they cause data loss. This guide compares three essential Linux tools for RAID health monitoring: mdadm monitor, smartctl, and raid-check.
Tool Comparison
| Feature | mdadm Monitor | smartctl (smartmontools) | raid-check |
|---|---|---|---|
| Purpose | Array event detection | Disk health prediction | Data integrity verification |
| Stars | Part of mdadm (kernel) | 1,160+ | Part of mdadm package |
| Monitoring Scope | RAID array state | Individual disk SMART attributes | Array data consistency |
| Alert Mechanism | Email / syslog | smartd daemon / email | Cron / email |
| Detection Speed | Seconds (events) | Hours/days (trending) | Hours (scrub duration) |
| Predictive | No (reactive) | Yes (SMART predicts failure) | No (detects corruption) |
| Installation | apt install mdadm | apt install smartmontools | apt install mdadm |
mdadm Monitor — Real-Time Array Event Detection
The mdadm --monitor daemon watches RAID arrays for state changes: disk failures, rebuilds, spare activations, and degraded states. It can send email alerts or execute custom scripts when problems occur.
| |
The monitor daemon captures critical events like Fail, FailSpare, DeviceDisappeared, and RebuildFinished. Configure it to send alerts through your notification pipeline.
| |
smartctl — Predictive Drive Health Monitoring
SMART (Self-Monitoring, Analysis, and Reporting Technology) provides early warning of drive failures through dozens of attributes. smartctl reads these attributes, and smartd continuously monitors them.
| |
Configure smartd for continuous monitoring:
| |
Key SMART attributes that predict drive failure: Reallocated_Sector_Ct (ID 5), Current_Pending_Sector (ID 197), Offline_Uncorrectable (ID 198), and UDMA_CRC_Error_Count (ID 199). Any non-zero value in the first three warrants immediate attention.
raid-check — Data Integrity Verification
raid-check performs periodic scrubs of RAID arrays, reading every block and verifying parity/mirror consistency. This catches silent data corruption that SMART cannot detect.
| |
For Debian/Ubuntu, the mdadm package includes the checkarray script:
| |
Building a Comprehensive RAID Monitoring Pipeline
Using mdadm, smartctl, and raid-check individually provides good coverage, but integrating them into a single monitoring pipeline gives you complete visibility and automated response to storage issues.
Unified Alerting with a Monitoring Script
Combine all three monitoring sources into a single health check script that runs periodically and reports the overall status:
| |
Integrating with Prometheus and Grafana
For production environments, export RAID and SMART metrics to Prometheus for dashboard visualization and alerting:
| |
Automated Drive Replacement Workflow
When SMART or mdadm detects a failing drive, a structured response workflow minimizes downtime and data risk:
- Detection: smartd alerts on Reallocated_Sector_Ct > 0 or mdadm reports a Failed device
- Isolation: The failed drive is marked faulty; the array continues operating in degraded mode
- Preparation: Identify the physical drive by serial number (smartctl -i), locate it in the chassis, and procure a replacement
- Replacement: Hot-swap the drive if supported, or schedule maintenance window
- Rebuild: Add the new drive with
mdadm --manage /dev/mdX --add /dev/sdY; monitor rebuild progress withcat /proc/mdstat - Verification: After rebuild completes, run a full check to verify data integrity
For production servers, keep at least one cold spare drive on hand for each drive model in your arrays. The time between ordering a replacement and installing it is the window where a second drive failure would mean permanent data loss.
Why Self-Host Your Storage Monitoring
Cloud storage providers handle RAID and drive health transparently — but you get zero visibility. Self-hosting your own storage with proper monitoring gives you data you can act on: SMART attribute trends that predict failures weeks in advance, real-time alerts when a drive drops from the array, and schedule-driven data integrity checks that catch silent corruption before it spreads. Combined, these three tools form a defense-in-depth strategy that cloud abstractions simply cannot match. For more storage management, see our guide on Linux LVM management. Our Btrfs snapshot management comparison covers another layer of data protection. For full filesystem integrity checking, read our fsck and repair guide.
FAQ
How often should I run RAID data checks?
Run a full check (scrub) at least once a month on all arrays. For arrays larger than 4TB, weekly checks are recommended because the scrub duration increases linearly with array size. During a check, the array remains fully operational — performance impact is typically 5-15% depending on your sync_speed_max setting. Schedule checks during low-usage windows (e.g., Sunday 1 AM).
What SMART values indicate an imminent drive failure?
Three counters demand immediate action: Reallocated_Sector_Ct > 0 (drive has found and remapped bad sectors), Current_Pending_Sector > 0 (sectors that can’t be read, pending remap), and Offline_Uncorrectable > 0 (permanently unreadable sectors). A drive with any of these above zero should be replaced as soon as practical. Also watch the Raw_Read_Error_Rate — an increasing trend, even if below threshold, signals degrading media.
Can mdadm monitor catch all failure modes?
No, which is why you need all three tools. mdadm monitor catches events that change the array state (disk failures, rebuilds), but it cannot predict failures or detect silent data corruption. A drive with corrupted data that still responds to I/O will not trigger an mdadm event — the data is simply wrong. This is where SMART (predicts failures) and raid-check (detects data corruption) fill the gaps.
How do I test that my monitoring alerts work?
Simulate failures in a test environment. For mdadm: remove a drive with mdadm --manage /dev/md0 --fail /dev/sdb1. Verify you receive an alert, then re-add with mdadm --manage /dev/md0 --re-add /dev/sdb1. For smartd: use smartd -q onecheck to trigger a one-time check with notification. For raid-check: manually run a check and verify the email notification configured in /etc/cron.d/raid-check.
Can I use these tools with hardware RAID controllers?
Partially. smartctl works with most hardware RAID controllers through the -d flag (e.g., -d megaraid,0 for LSI MegaRAID). mdadm only monitors Linux software RAID (md) and does not work with hardware RAID. For hardware RAID, use the vendor’s management utility — storcli for Broadcom/LSI, hpssacli for HPE, perccli for Dell PERC — which provide equivalent monitoring and alerting capabilities.
💰 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到技术监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测技术相关事件的走向已经赚了不少。用我的邀请链接注册:Polymarket.com