User Tools

Site Tools


doctor:faults

This is an old revision of the document!


Faults

Initial list of faults

Faults can be gathered by enabling SNMP and installing some opensource tool to catch and poll SNMP. When using for example Zabbix one can also put agent running on host to catch any other fault. Here is some initial list of high level faults and how they can be caught. List assumes that one enables usage of SNMP and then would use tool like Zabbix. There is also Pacemaker mentioned if used. Usage of that is limited to number of nodes, so it works better only for controller nodes.

Raw faultsHow to detectEffected virtual resourceActions to recover
WhereDescriptionMethodComment
chassisBlade not presentSNMP
Chassis fan degradedSNMP
Chassis fan failedSNMP
Chassis fan not presentSNMP
Chassis manager degradedSNMP
Chassis manager failedSNMP
Chassis manager not presentSNMP
Chassis power degradedSNMP
Chassis power failedSNMP
Chassis power input line status errorSNMP
Chassis power not presentSNMP
Chassis removalSNMP
Network connector not presentSNMP
disk arrayDisk array errorSNMP
libvirtState of a virtual machine has changedSNMP
openstackOpenstack service is in failed state such as nova-computezabbix agent
Openstack statuszabbix agent
Openvswitch daemon is not in active statezabbix agent
Openvswitch statuszabbix agent
osAvailable memory too lowzabbix agent
Free FS space is less than 10% on volume {#FSNAME}zabbix agent
Host information has changedzabbix agent
Processor load too highzabbix agent
System has restartedzabbix agent
Zabbix agent is unreachable
pacemakerCorosync is not in active stateSNMPcontroller node, as limited?
Pacemaker is not in active stateSNMPcontroller node, as limited?
Pacemaker node {#NODENAME} status has changed on {HOST.NAME}SNMPcontroller node, as limited?
Pacemaker PCS daemon is not in active stateSNMPcontroller node, as limited?
Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME}SNMPcontroller node, as limited?
serverCold startSNMP
Cpu condition not okSNMP
Fan degradedSNMP
Fan failedSNMP
Fan not presentSNMP
Fan redundancy lostSNMP
HW SNMP agent authentication failureSNMP
Network adapter connectivity lostSNMP
Memory condition not okSNMP
POST errorSNMP
Power degradedSNMP
Power failedSNMP
Power not presentSNMP
Power redundancy lostSNMP
Power threshold exceededSNMP
security override engagedSNMP
self test errorSNMP
Server power offSNMP
Server power onSNMP
Server power on failureSNMP
Server resetSNMP
Temperature status degradedSNMP
Thermal condition not okSNMP
Thermal confirmationSNMPUp again after thermal shudown
switchLink downSNMP
Link upSNMP

Describing faults

Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some OpenStack internal fault will always be caught the same way.

doctor/faults.1419385820.txt.gz · Last modified: 2014/12/24 01:50 by Ryota Mibu