Master the process of recovering and re-adding failed drives to your mdadm RAID array while ensuring data integrity and optimal performance.

Re-adding Failed Drives in mdadm

Understanding RAID Failures

When a drive fails in an mdadm RAID array:

Array enters degraded mode
Hot spare may be activated
Performance may be impacted
Data redundancy is reduced

Initial Assessment

1. Check Array Status

# View array status
mdadm --detail /dev/md0

# Check drive status
cat /proc/mdstat

# View detailed drive information
smartctl -a /dev/sda

2. Identify Failed Drive

# List all drives in array
mdadm --examine /dev/sd[a-z]

# Check specific drive
mdadm --examine /dev/sdb1

Recovery Process

1. Testing Failed Drive

# Check for bad blocks
badblocks -v /dev/sdb > badblocks.txt

# Run SMART tests
smartctl -t long /dev/sdb
smartctl -l selftest /dev/sdb

# Check drive health
smartctl -H /dev/sdb

2. Re-adding the Drive

# Mark drive as failed (if needed)
mdadm /dev/md0 -f /dev/sdb1

# Remove the failed drive
mdadm /dev/md0 -r /dev/sdb1

# Add the drive back
mdadm /dev/md0 -a /dev/sdb1

3. Monitor Rebuild Process

# Watch rebuild progress
watch cat /proc/mdstat

# Check detailed status
mdadm --detail /dev/md0

Advanced Recovery Scenarios

1. Forced Assembly

# Force array assembly
mdadm --assemble --force /dev/md0 /dev/sd[b-e]1

# Run array check after force
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mdadm --assemble --scan

2. Partial Array Recovery

# Start array with missing drive
mdadm --run /dev/md0

# Add replacement drive
mdadm --add /dev/md0 /dev/sdb1

Performance Optimization

1. Rebuild Speed Control

# View current speed
cat /proc/sys/dev/raid/speed_limit_min
cat /proc/sys/dev/raid/speed_limit_max

# Adjust rebuild speed
echo 50000 > /proc/sys/dev/raid/speed_limit_min
echo 100000 > /proc/sys/dev/raid/speed_limit_max

2. Stripe Cache Size

# Check current cache size
cat /sys/block/md0/md/stripe_cache_size

# Optimize cache size
echo 8192 > /sys/block/md0/md/stripe_cache_size

Preventive Maintenance

1. Regular Health Checks

#!/bin/bash
# raid-health-check.sh

check_raid() {
    local array=$1
    
    # Check array status
    status=$(mdadm --detail $array | grep "State" | awk '{print $3}')
    
    if [ "$status" != "clean" ]; then
        echo "Warning: Array $array is in $status state"
        mdadm --detail $array
    fi
    
    # Check individual drives
    mdadm --detail $array | grep "active" | while read line; do
        drive=$(echo $line | awk '{print $7}')
        smart_status=$(smartctl -H $drive | grep "overall-health" | awk '{print $6}')
        
        if [ "$smart_status" != "PASSED" ]; then
            echo "Warning: Drive $drive failed SMART check"
        fi
    done
}

# Check all arrays
for md in /dev/md*; do
    check_raid $md
done

2. Automated Monitoring

# /etc/monit/conf.d/raid-monitor
check program raid-status with path "/usr/local/bin/raid-health-check.sh"
    if status != 0 then alert

Best Practices

1. Documentation

#!/bin/bash
# document-raid.sh

echo "RAID Configuration Documentation" > raid-doc.txt
echo "Generated on $(date)" >> raid-doc.txt
echo "------------------------" >> raid-doc.txt

# Document array configuration
mdadm --detail --scan >> raid-doc.txt

# Document drive details
for drive in $(ls /dev/sd[a-z]); do
    echo -e "\nDrive: $drive" >> raid-doc.txt
    smartctl -i $drive >> raid-doc.txt
done

2. Backup Strategy

# Backup array configuration
cp /etc/mdadm/mdadm.conf /etc/mdadm/mdadm.conf.backup

# Save array details
mdadm --detail /dev/md0 > array_details.txt
mdadm --examine /dev/sd[a-z]1 > drive_details.txt

Troubleshooting Guide

1. Common Issues

# Check system logs
journalctl -f | grep md0

# View detailed errors
dmesg | grep raid

# Check drive errors
smartctl -l error /dev/sdb

2. Recovery Verification

# Check array consistency
echo check > /sys/block/md0/md/sync_action

# Monitor check progress
watch cat /proc/mdstat

# Verify data integrity
md5sum -c checksum.txt

Remember to always maintain backups and document your RAID configuration. Regular monitoring and proactive maintenance can help prevent drive failures and ensure quick recovery when issues occur.

Support Tools

Re-adding Failed Drives in mdadm: A Complete Recovery Guide