Master the process of recovering and re-adding failed drives to your mdadm RAID array while ensuring data integrity and optimal performance.

Re-adding Failed Drives in mdadm

Understanding RAID Failures

When a drive fails in an mdadm RAID array:

  • Array enters degraded mode
  • Hot spare may be activated
  • Performance may be impacted
  • Data redundancy is reduced

Initial Assessment

1. Check Array Status

# View array status
mdadm --detail /dev/md0

# Check drive status
cat /proc/mdstat

# View detailed drive information
smartctl -a /dev/sda

2. Identify Failed Drive

# List all drives in array
mdadm --examine /dev/sd[a-z]

# Check specific drive
mdadm --examine /dev/sdb1

Recovery Process

1. Testing Failed Drive

# Check for bad blocks
badblocks -v /dev/sdb > badblocks.txt

# Run SMART tests
smartctl -t long /dev/sdb
smartctl -l selftest /dev/sdb

# Check drive health
smartctl -H /dev/sdb

2. Re-adding the Drive

# Mark drive as failed (if needed)
mdadm /dev/md0 -f /dev/sdb1

# Remove the failed drive
mdadm /dev/md0 -r /dev/sdb1

# Add the drive back
mdadm /dev/md0 -a /dev/sdb1

3. Monitor Rebuild Process

# Watch rebuild progress
watch cat /proc/mdstat

# Check detailed status
mdadm --detail /dev/md0

Advanced Recovery Scenarios

1. Forced Assembly

# Force array assembly
mdadm --assemble --force /dev/md0 /dev/sd[b-e]1

# Run array check after force
mdadm --detail --scan >> /etc/mdadm/mdadm.conf
mdadm --assemble --scan

2. Partial Array Recovery

# Start array with missing drive
mdadm --run /dev/md0

# Add replacement drive
mdadm --add /dev/md0 /dev/sdb1

Performance Optimization

1. Rebuild Speed Control

# View current speed
cat /proc/sys/dev/raid/speed_limit_min
cat /proc/sys/dev/raid/speed_limit_max

# Adjust rebuild speed
echo 50000 > /proc/sys/dev/raid/speed_limit_min
echo 100000 > /proc/sys/dev/raid/speed_limit_max

2. Stripe Cache Size

# Check current cache size
cat /sys/block/md0/md/stripe_cache_size

# Optimize cache size
echo 8192 > /sys/block/md0/md/stripe_cache_size

Preventive Maintenance

1. Regular Health Checks

#!/bin/bash
# raid-health-check.sh

check_raid() {
    local array=$1
    
    # Check array status
    status=$(mdadm --detail $array | grep "State" | awk '{print $3}')
    
    if [ "$status" != "clean" ]; then
        echo "Warning: Array $array is in $status state"
        mdadm --detail $array
    fi
    
    # Check individual drives
    mdadm --detail $array | grep "active" | while read line; do
        drive=$(echo $line | awk '{print $7}')
        smart_status=$(smartctl -H $drive | grep "overall-health" | awk '{print $6}')
        
        if [ "$smart_status" != "PASSED" ]; then
            echo "Warning: Drive $drive failed SMART check"
        fi
    done
}

# Check all arrays
for md in /dev/md*; do
    check_raid $md
done

2. Automated Monitoring

# /etc/monit/conf.d/raid-monitor
check program raid-status with path "/usr/local/bin/raid-health-check.sh"
    if status != 0 then alert

Best Practices

1. Documentation

#!/bin/bash
# document-raid.sh

echo "RAID Configuration Documentation" > raid-doc.txt
echo "Generated on $(date)" >> raid-doc.txt
echo "------------------------" >> raid-doc.txt

# Document array configuration
mdadm --detail --scan >> raid-doc.txt

# Document drive details
for drive in $(ls /dev/sd[a-z]); do
    echo -e "\nDrive: $drive" >> raid-doc.txt
    smartctl -i $drive >> raid-doc.txt
done

2. Backup Strategy

# Backup array configuration
cp /etc/mdadm/mdadm.conf /etc/mdadm/mdadm.conf.backup

# Save array details
mdadm --detail /dev/md0 > array_details.txt
mdadm --examine /dev/sd[a-z]1 > drive_details.txt

Troubleshooting Guide

1. Common Issues

# Check system logs
journalctl -f | grep md0

# View detailed errors
dmesg | grep raid

# Check drive errors
smartctl -l error /dev/sdb

2. Recovery Verification

# Check array consistency
echo check > /sys/block/md0/md/sync_action

# Monitor check progress
watch cat /proc/mdstat

# Verify data integrity
md5sum -c checksum.txt

Remember to always maintain backups and document your RAID configuration. Regular monitoring and proactive maintenance can help prevent drive failures and ensure quick recovery when issues occur.