This is an old revision of the document!
Faults can be gathered by enabling SNMP and installing some opensource tool to catch and poll SNMP. When using for example Zabbix one can also put agent running on host to catch any other fault. Here is some initial list of high level faults and how they can be caught. List assumes that one enables usage of SNMP and then would use tool like Zabbix. There is also Pacemaker mentioned if used. Usage of that is limited to number of nodes, so it works better only for controller nodes.
Where | Description | Method | Comment |
---|---|---|---|
chassis | Blade not present | SNMP | |
chassis | Chassis fan degraded | SNMP | |
chassis | Chassis fan failed | SNMP | |
chassis | Chassis fan not present | SNMP | |
chassis | Chassis manager degraded | SNMP | |
chassis | Chassis manager failed | SNMP | |
chassis | Chassis manager not present | SNMP | |
chassis | Chassis power degraded | SNMP | |
chassis | Chassis power failed | SNMP | |
chassis | Chassis power input line status error | SNMP | |
chassis | Chassis power not present | SNMP | |
chassis | Chassis removal | SNMP | |
chassis | Network connector not present | SNMP | |
disk array | Disk array error | SNMP | |
libvirt | State of a virtual machine has changed | SNMP | |
openstack | Openstack service is in failed state | zabbix agent | |
openstack | Openstack status | zabbix agent | |
openstack | Openvswitch daemon is not in active state | zabbix agent | |
openstack | Openvswitch status | zabbix agent | |
os | Available memory too low | zabbix agent | |
os | Free FS space is less than 10% on volume {#FSNAME} | zabbix agent | |
os | Host information has changed | zabbix agent | |
os | Processor load too high | zabbix agent | |
os | System has restarted | zabbix agent | |
os | Zabbix agent is unreachable | ||
pacemaker | Corosync is not in active state | SNMP | controller node, as limited? |
pacemaker | Pacemaker is not in active state | SNMP | controller node, as limited? |
pacemaker | Pacemaker node {#NODENAME} status has changed on {HOST.NAME} | SNMP | controller node, as limited? |
pacemaker | Pacemaker PCS daemon is not in active state | SNMP | controller node, as limited? |
pacemaker | Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME} | SNMP | controller node, as limited? |
server | Cold start | SNMP | |
server | Cpu condition not ok | SNMP | |
server | Fan degraded | SNMP | |
server | Fan failed | SNMP | |
server | Fan not present | SNMP | |
server | Fan redundancy lost | SNMP | |
server | HW SNMP agent authentication failure | SNMP | |
server | Network adapter connectivity lost | SNMP | |
server | Memory condition not ok | SNMP | |
server | Post error | SNMP | |
server | Power degraded | SNMP | |
server | Power failed | SNMP | |
server | Power not present | SNMP | |
server | Power redundancy lost | SNMP | |
server | Power threshold exceeded | SNMP | |
server | security override engaged | SNMP | |
server | self test error | SNMP | |
server | Server power off | SNMP | |
server | Server power on | SNMP | |
server | Server power on failure | SNMP | |
server | Server reset | SNMP | |
server | Temperature status degraded | SNMP | |
server | Thermal condition not ok | SNMP | |
server | Thermal confirmation | SNMP | Up again after thermal shudown |
switch | Link down | SNMP | |
switch | Link up | SNMP |
Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some Openstack internal fault will always be caught the same way.