公司有台10个盘组的RAID10的服务器列阵挂了,坏了3个盘,超过了RAID10的容错,已经无法通过列阵恢复数据了。同时挂3个是很小几率的,所以也有一定可能刚开始挂了1-2个不知道,直到第3个挂了down机了才发现。

因为机器很多,人工每天检查一次会比较耗时,每周检查也许又会太长。所以写了个py小脚本,可以定时1小时检查次,出现故障可以立即通知到技术立即去进一步检查、换盘,避免悲剧发生。

1、安装MegaCLI

# wget https://raw.githubusercontent.com/crazy-zhangcong/tools/master/MegaCli8.07.10.tar.gz && tar -zxf MegaCli8.07.10.tar.gz && cd MegaCli8.07.10/Linux/ && rpm -ivh Lib_Utils-1.00-09.noarch.rpm MegaCli-8.02.21-1.noarch.rpm && ln -s /opt/MegaRAID/MegaCli/MegaCli64 /usr/local/bin/MegaCli && MegaCli -v 

出现如下信息即正常完成安装

MegaCLI SAS RAID Management Tool Ver 8.02.21 Oct 21, 2011
(c)Copyright 2011, LSI Corporation, All Rights Reserved. Exit Code:
0x00

2、Python脚本

# -*- coding: utf-8 -*-
#!/usr/bin/python
import os
import requests

node = '宽带型VPS宿主服务器1' #填写节点名称
error = 0

def get_status(value):
    status = value.split(": ")
    return status[1]

def send_warning():
    global node

    # 语音通知 基于 https://www.mysubmail.com 语音通知接口
    voice_url = 'https://api.mysubmail.com/voice/send.json'
    voice_params = { 'appid': '',
                      'to': '13200000000',
                      'content': '紧急事态:'+node+'硬盘状态异常,请立即检查',
                      'signature': '' #填写应用密匙
                   }
    voice_res = requests.post(voice_url, data=voice_params)
    # print voice_res.text

    # 短信通知 基于 https://www.mysubmail.com 短信接口
    message_url = 'https://api.mysubmail.com/message/send.json'
    message_params = { 'appid': '',
                       'to': '13200000000',
                       'content': '【xx科技】紧急事态:'+node+'硬盘状态异常,请立即检查',
                       'signature': '' #填写应用密匙
                     }
    message_res = requests.post(message_url, data=message_params)
    # print message_res.text


# 检查RAID状态
# raidinfos = open('raid.log','r')
# for raidinfo in raidinfos.readlines():
raidinfos = os.popen('/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL -NoLOG').readlines()
for raidinfo in raidinfos:
    raidinfo = raidinfo.strip('\n')
    if "State" in raidinfo:
        status = get_status(raidinfo)
        if status != 'Optimal':
            error = 1
        print(raidinfo+'\n')


# 检查所有磁盘状态
# pdlist = open('raid_pdlist.log','r')
# for line in pdlist.readlines():
pdlist = os.popen('/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL -NoLOG').readlines()
for line in pdlist:
    line = line.strip('\n')

    if "Media Error Count" in line:
        status = get_status(line)
        status = int(status)
        if status != 0:
            error = 1
        print(line)

    if "Other Error Count" in line:
        status = get_status(line)
        status = int(status)
        if status != int('1158'):
            error = 1
        print(line)
        
    if "Predictive Failure Count" in line:
        status = get_status(line)
        status = int(status)
        if status != 0:
            error = 1
        print(line)

    if "Firmware state" in line:
        status = get_status(line)
        if status != 'Online, Spun Up':
            error = 1
        print(line+'\n')

# 发送通知
if error == 1:
    send_warning()

之后通过crontab定时执行就可以了。有很多可以优化的地方,比如记录故障硬盘的编号、哪些参数结果不对。

标签: Linux管理维护, Python

添加新评论