User Tools

Site Tools


doctor:faults

This is an old revision of the document!


Faults

Initial list of faults

Faults can be gathered by enabling SNMP and installing some opensource tool to catch and poll SNMP. When using for example Zabbix one can also put agent running on host to catch any other fault. Here is some initial list of high level faults and how they can be caught. List assumes that one enables usage of SNMP and then would use tool like Zabbix. There is also Pacemaker mentioned if used. Usage of that is limited to number of nodes, so it works better only for controller nodes.

WhereDescriptionMethodComment
chassisBlade not presentSNMP
chassisChassis fan degradedSNMP
chassisChassis fan failedSNMP
chassisChassis fan not presentSNMP
chassisChassis manager degradedSNMP
chassisChassis manager failedSNMP
chassisChassis manager not presentSNMP
chassisChassis power degradedSNMP
chassisChassis power failedSNMP
chassisChassis power input line status errorSNMP
chassisChassis power not presentSNMP
chassisChassis removalSNMP
chassisNetwork connector not presentSNMP
disk arrayDisk array errorSNMP
libvirtState of a virtual machine has changedSNMP
openstackOpenstack service is in failed statezabbix agent
openstackOpenstack statuszabbix agent
openstackOpenvswitch daemon is not in active statezabbix agent
openstackOpenvswitch statuszabbix agent
osAvailable memory too lowzabbix agent
osFree FS space is less than 10% on volume {#FSNAME}zabbix agent
osHost information has changedzabbix agent
osProcessor load too highzabbix agent
osSystem has restartedzabbix agent
osZabbix agent is unreachable
pacemakerCorosync is not in active stateSNMPcontroller node, as limited?
pacemakerPacemaker is not in active stateSNMPcontroller node, as limited?
pacemakerPacemaker node {#NODENAME} status has changed on {HOST.NAME}SNMPcontroller node, as limited?
pacemakerPacemaker PCS daemon is not in active stateSNMPcontroller node, as limited?
pacemakerPacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME}SNMPcontroller node, as limited?
serverCold startSNMP
serverCpu condition not okSNMP
serverFan degradedSNMP
serverFan failedSNMP
serverFan not presentSNMP
serverFan redundancy lostSNMP
serverHW SNMP agent authentication failureSNMP
serverNetwork adapter connectivity lostSNMP
serverMemory condition not okSNMP
serverPost errorSNMP
serverPower degradedSNMP
serverPower failedSNMP
serverPower not presentSNMP
serverPower redundancy lostSNMP
serverPower threshold exceededSNMP
serversecurity override engagedSNMP
serverself test errorSNMP
serverServer power offSNMP
serverServer power onSNMP
serverServer power on failureSNMP
serverServer resetSNMP
serverTemperature status degradedSNMP
serverThermal condition not okSNMP
serverThermal confirmationSNMPUp again after thermal shudown
switchLink downSNMP
switchLink upSNMP

Describing faults

Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some Openstack internal fault will always be caught the same way.

doctor/faults.1419324168.txt.gz · Last modified: 2014/12/23 08:42 by Tomi Juvonen