User Tools

Site Tools


doctor:faults

Faults

Initial list of faults

Faults in the listed elements need to be immediately notified to the VNFM in order to perform an immediate action like live migration or switch to a hot standby entity. In addition, a maintenance action should be triggered to, e.g., reboot the server or replace a defect hardware element.

Faults can be of different severity, i.e. critical, warning, maintenance, or info. Critical faults require immediate action as a severe degradation of the system has happened or is expected. Warnings indicate that the system performance is going down: related actions include closer (e.g. more frequent) monitoring of that part of the system or preparation for a cold migration to a backup VM. Type maintenance may trigger maintenance actions like a re-boot of the server or replacement of a faulty, but redundant HW. Info messages do not require any action.

Faults can be gathered by, e.g., enabling SNMP and installing some open source tools to catch and poll SNMP. When using for example Zabbix one can also put an agent running on the hosts to catch any other fault. Table 1 provides a list of high level faults that are considered within the scope of the Doctor project requiring immediate action by the VNFM.

High level list of faults:

Service Fault Severity How to detect? Comment Action to recover
Compute Hardware Processor/CPU failure, CPU condition not ok Critical Zabbix Switch to hot standby
Memory failure / Memory condition not ok Critical Zabbix (IPMI) Switch to hot standby
Network card failure, e.g. Network adapter connectivity lost Critical Zabbix / Ceilometer Switch to hot standby
Disk crash Info RAID monitoring Network storage is very redundant (e.g. RAID system) and can guarantee high availability. Inform OAM
Disk aging Info S.M.A.R.T (IPMI or OS) Inform OAM
Storage controller Critical Zabbix (IMPI) Live migration if storage is still accessible; otherwise Hot Standby
PDU/power failure, power off, server reset Critical Zabbix / Ceilometer Switch to hot standby
Power degradation, Power redundancy lost, Power threshold exceeded Warning SNMP Live migration
Chassis problem (e.g. fan degraded/failed, chassis power degraded), CPU fan problem, Temperature/thermal condition not okay Warning SNMP Live migration
Mainboard failure Critical Zabbix (IPMI) Switch to hot standby
OS crash (e.g. kernel panic) Critical Zabbix Switch to hot standby
Hypervisor System has restarted Critical Zabbix Switch to hot standby
Hypervisor failure Warning / Critical Zabbix / Ceilometer Migration / switch to hot standby
Zabbix / Ceilometer is unreachable Warning ? Live migration
Networking SDN/OpenFlow Switch/Controller degraded/failed Critical ? Switch to hot standby or reconfigure virtual network topology
HW failure of physical switch/router Warning SNMP Redundancy of physical infrastructure is reduced or no longer available. Inform OAM to replace defect HW and configure new HW
Raw faults How to detect Effected virtual resource Actions to recover
Where Description Severity Method Comment
chassis Blade not present SNMP
Chassis fan degraded Warning SNMP
Chassis fan failed SNMP
Chassis fan not present SNMP
Chassis manager degraded SNMP
Chassis manager failed SNMP
Chassis manager not present SNMP
Chassis power degraded SNMP
Chassis power failed SNMP
Chassis power input line status error SNMP
Chassis power not present SNMP
Chassis removal SNMP
Network connector not present SNMP
disk array Disk array error SNMP
libvirt State of a virtual machine has changed SNMP
openstack Openstack service is in failed state such as nova-compute zabbix agent
Openstack status zabbix agent
Openvswitch daemon is not in active state zabbix agent
Openvswitch status zabbix agent
os Available memory too low zabbix agent
Free FS space is less than 10% on volume {#FSNAME} zabbix agent
Host information has changed zabbix agent
Processor load too high zabbix agent
System has restarted zabbix agent
Zabbix agent is unreachable
pacemaker Corosync is not in active state SNMP controller node, as limited?
Pacemaker is not in active state SNMP controller node, as limited?
Pacemaker node {#NODENAME} status has changed on {HOST.NAME} SNMP controller node, as limited?
Pacemaker PCS daemon is not in active state SNMP controller node, as limited?
Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME} SNMP controller node, as limited?
server Cold start SNMP
Cpu condition not ok SNMP
Fan degraded SNMP
Fan failed SNMP
Fan not present SNMP
Fan redundancy lost SNMP
HW SNMP agent authentication failure SNMP
Network adapter connectivity lost SNMP
Memory condition not ok SNMP
POST error SNMP
Power degraded SNMP
Power failed Critical SNMP
Power not present Critical SNMP
Power redundancy lost Warning SNMP
Power threshold exceeded Warning SNMP
security override engaged SNMP
self test error SNMP
Server power off Critical SNMP
Server power on SNMP
Server power on failure SNMP
Server reset SNMP
Temperature status degraded Warning SNMP
Thermal condition not ok Warning SNMP
Thermal confirmation SNMP Up again after thermal shudown
switch Link down Critical SNMP
Link up Info SNMP

Describing faults

Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some OpenStack internal fault will always be caught the same way.

doctor/faults.txt · Last modified: 2015/02/17 15:24 by Gerald Kunzmann