====== Faults ====== ===== Initial list of faults ===== Faults in the listed elements need to be immediately notified to the VNFM in order to perform an immediate action like live migration or switch to a hot standby entity. In addition, a maintenance action should be triggered to, e.g., reboot the server or replace a defect hardware element. Faults can be of different severity, i.e. critical, warning, maintenance, or info. Critical faults require immediate action as a severe degradation of the system has happened or is expected. Warnings indicate that the system performance is going down: related actions include closer (e.g. more frequent) monitoring of that part of the system or preparation for a cold migration to a backup VM. Type maintenance may trigger maintenance actions like a re-boot of the server or replacement of a faulty, but redundant HW. Info messages do not require any action. Faults can be gathered by, e.g., enabling SNMP and installing some open source tools to catch and poll SNMP. When using for example Zabbix one can also put an agent running on the hosts to catch any other fault. Table 1 provides a list of high level faults that are considered within the scope of the Doctor project requiring immediate action by the VNFM. === High level list of faults: === ^ Service ^ Fault ^ Severity ^ How to detect? ^ Comment ^ Action to recover ^ | **Compute Hardware** | Processor/CPU failure, CPU condition not ok | Critical | Zabbix | | Switch to hot standby | | ::: | Memory failure / Memory condition not ok | Critical | Zabbix (IPMI) | | Switch to hot standby | | ::: | Network card failure, e.g. Network adapter connectivity lost | Critical | Zabbix / Ceilometer | | Switch to hot standby | | ::: | Disk crash | Info | RAID monitoring | Network storage is very redundant (e.g. RAID system) and can guarantee high availability. | Inform OAM | | ::: | Disk aging | Info | S.M.A.R.T (IPMI or OS) | | Inform OAM | | ::: | Storage controller | Critical | Zabbix (IMPI) | | Live migration if storage is still accessible; otherwise Hot Standby | | ::: | PDU/power failure, power off, server reset | Critical | Zabbix / Ceilometer | | Switch to hot standby | | ::: | Power degradation, Power redundancy lost, Power threshold exceeded | Warning | SNMP | | Live migration | | ::: | Chassis problem (e.g. fan degraded/failed, chassis power degraded), CPU fan problem, Temperature/thermal condition not okay | Warning | SNMP | | Live migration | | ::: | Mainboard failure | Critical | Zabbix (IPMI) | | Switch to hot standby | | ::: | OS crash (e.g. kernel panic) | Critical | Zabbix | | Switch to hot standby | | **Hypervisor** | System has restarted | Critical | Zabbix | | Switch to hot standby | | ::: | Hypervisor failure | Warning / Critical | Zabbix / Ceilometer | | Migration / switch to hot standby | | ::: | Zabbix / Ceilometer is unreachable | Warning | ? | | Live migration | | **Networking** | SDN/OpenFlow Switch/Controller degraded/failed | Critical | ? | | Switch to hot standby or reconfigure virtual network topology | | ::: | HW failure of physical switch/router | Warning | SNMP | Redundancy of physical infrastructure is reduced or no longer available. | Inform OAM to replace defect HW and configure new HW | ^ Raw faults ||^ How to detect |^ Effected virtual resource ^ Actions to recover ^ ^ Where ^ Description ^ Severity ^ Method ^ Comment | | | | chassis | Blade not present | | SNMP | | | | | ::: | Chassis fan degraded | Warning | SNMP | | | | | ::: | Chassis fan failed | | SNMP | | | | | ::: | Chassis fan not present | | SNMP | | | | | ::: | Chassis manager degraded | | SNMP | | | | | ::: | Chassis manager failed | | SNMP | | | | | ::: | Chassis manager not present | | SNMP | | | | | ::: | Chassis power degraded | | SNMP | | | | | ::: | Chassis power failed | | SNMP | | | | | ::: | Chassis power input line status error | | SNMP | | | | | ::: | Chassis power not present | | SNMP | | | | | ::: | Chassis removal | | SNMP | | | | | ::: | Network connector not present | | SNMP | | | | | disk array | Disk array error | | SNMP | | | | | libvirt | State of a virtual machine has changed | | SNMP | | | | | openstack | Openstack service is in failed state such as nova-compute | | zabbix agent | | | | | ::: | Openstack status | | zabbix agent | | | | | ::: | Openvswitch daemon is not in active state | | zabbix agent | | | | | ::: | Openvswitch status | | zabbix agent | | | | | os | Available memory too low | | zabbix agent | | | | | ::: | Free FS space is less than 10% on volume {#FSNAME} | | zabbix agent | | | | | ::: | Host information has changed | | zabbix agent | | | | | ::: | Processor load too high | | zabbix agent | | | | | ::: | System has restarted | | zabbix agent | | | | | ::: | Zabbix agent is unreachable | | | | | | | pacemaker | Corosync is not in active state | | SNMP | controller node, as limited? | | | | ::: | Pacemaker is not in active state | | SNMP | controller node, as limited? | | | | ::: | Pacemaker node {#NODENAME} status has changed on {HOST.NAME} | | SNMP | controller node, as limited? | | | | ::: | Pacemaker PCS daemon is not in active state | | SNMP | controller node, as limited? | | | | ::: | Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME} | | SNMP | controller node, as limited? | | | | server | Cold start | | SNMP | | | | | ::: | Cpu condition not ok | | SNMP | | | | | ::: | Fan degraded | | SNMP | | | | | ::: | Fan failed | | SNMP | | | | | ::: | Fan not present | | SNMP | | | | | ::: | Fan redundancy lost | | SNMP | | | | | ::: | HW SNMP agent authentication failure | | SNMP | | | | | ::: | Network adapter connectivity lost | | SNMP | | | | | ::: | Memory condition not ok | | SNMP | | | | | ::: | POST error | | SNMP | | | | | ::: | Power degraded | | SNMP | | | | | ::: | Power failed | Critical | SNMP | | | | | ::: | Power not present | Critical | SNMP | | | | | ::: | Power redundancy lost | Warning | SNMP | | | | | ::: | Power threshold exceeded | Warning | SNMP | | | | | ::: | security override engaged | | SNMP | | | | | ::: | self test error | | SNMP | | | | | ::: | Server power off | Critical | SNMP | | | | | ::: | Server power on | | SNMP | | | | | ::: | Server power on failure | | SNMP | | | | | ::: | Server reset | | SNMP | | | | | ::: | Temperature status degraded | Warning | SNMP | | | | | ::: | Thermal condition not ok | Warning | SNMP | | | | | ::: | Thermal confirmation | | SNMP | Up again after thermal shudown | | | | switch | Link down | Critical | SNMP | | | | | ::: | Link up | Info | SNMP | | | | ===== Describing faults ===== Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some OpenStack internal fault will always be caught the same way.