Faults

Initial list of faults

Faults in the listed elements need to be immediately notified to the VNFM in order to perform an immediate action like live migration or switch to a hot standby entity. In addition, a maintenance action should be triggered to, e.g., reboot the server or replace a defect hardware element.

Faults can be of different severity, i.e. critical, warning, maintenance, or info. Critical faults require immediate action as a severe degradation of the system has happened or is expected. Warnings indicate that the system performance is going down: related actions include closer (e.g. more frequent) monitoring of that part of the system or preparation for a cold migration to a backup VM. Type maintenance may trigger maintenance actions like a re-boot of the server or replacement of a faulty, but redundant HW. Info messages do not require any action.

Faults can be gathered by, e.g., enabling SNMP and installing some open source tools to catch and poll SNMP. When using for example Zabbix one can also put an agent running on the hosts to catch any other fault. Table 1 provides a list of high level faults that are considered within the scope of the Doctor project requiring immediate action by the VNFM.

High level list of faults:

Service	Fault	Severity	How to detect?	Comment	Action to recover
Compute Hardware	Processor/CPU failure, CPU condition not ok	Critical	Zabbix		Switch to hot standby
	Memory failure / Memory condition not ok	Critical	Zabbix (IPMI)		Switch to hot standby
	Network card failure, e.g. Network adapter connectivity lost	Critical	Zabbix / Ceilometer		Switch to hot standby
	Disk crash	Info	RAID monitoring	Network storage is very redundant (e.g. RAID system) and can guarantee high availability.	Inform OAM
	Disk aging	Info	S.M.A.R.T (IPMI or OS)		Inform OAM
	Storage controller	Critical	Zabbix (IMPI)		Live migration if storage is still accessible; otherwise Hot Standby
	PDU/power failure, power off, server reset	Critical	Zabbix / Ceilometer		Switch to hot standby
	Power degradation, Power redundancy lost, Power threshold exceeded	Warning	SNMP		Live migration
	Chassis problem (e.g. fan degraded/failed, chassis power degraded), CPU fan problem, Temperature/thermal condition not okay	Warning	SNMP		Live migration
	Mainboard failure	Critical	Zabbix (IPMI)		Switch to hot standby
	OS crash (e.g. kernel panic)	Critical	Zabbix		Switch to hot standby
Hypervisor	System has restarted	Critical	Zabbix		Switch to hot standby
	Hypervisor failure	Warning / Critical	Zabbix / Ceilometer		Migration / switch to hot standby
	Zabbix / Ceilometer is unreachable	Warning	?		Live migration
Networking	SDN/OpenFlow Switch/Controller degraded/failed	Critical	?		Switch to hot standby or reconfigure virtual network topology
Networking	HW failure of physical switch/router	Warning	SNMP	Redundancy of physical infrastructure is reduced or no longer available.	Inform OAM to replace defect HW and configure new HW

Raw faults			How to detect
Where	Description	Severity	Method	Comment
chassis	Blade not present		SNMP
	Chassis fan degraded	Warning	SNMP
	Chassis fan failed		SNMP
	Chassis fan not present		SNMP
	Chassis manager degraded		SNMP
	Chassis manager failed		SNMP
	Chassis manager not present		SNMP
	Chassis power degraded		SNMP
	Chassis power failed		SNMP
	Chassis power input line status error		SNMP
	Chassis power not present		SNMP
	Chassis removal		SNMP
	Network connector not present		SNMP
disk array	Disk array error		SNMP
libvirt	State of a virtual machine has changed		SNMP
openstack	Openstack service is in failed state such as nova-compute		zabbix agent
	Openstack status		zabbix agent
	Openvswitch daemon is not in active state		zabbix agent
	Openvswitch status		zabbix agent
os	Available memory too low		zabbix agent
	Free FS space is less than 10% on volume {#FSNAME}		zabbix agent
	Host information has changed		zabbix agent
	Processor load too high		zabbix agent
	System has restarted		zabbix agent
	Zabbix agent is unreachable
pacemaker	Corosync is not in active state		SNMP	controller node, as limited?
	Pacemaker is not in active state		SNMP	controller node, as limited?
	Pacemaker node {#NODENAME} status has changed on {HOST.NAME}		SNMP	controller node, as limited?
	Pacemaker PCS daemon is not in active state		SNMP	controller node, as limited?
	Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME}		SNMP	controller node, as limited?
server	Cold start		SNMP
	Cpu condition not ok		SNMP
	Fan degraded		SNMP
	Fan failed		SNMP
	Fan not present		SNMP
	Fan redundancy lost		SNMP
	HW SNMP agent authentication failure		SNMP
	Network adapter connectivity lost		SNMP
	Memory condition not ok		SNMP
	POST error		SNMP
	Power degraded		SNMP
	Power failed	Critical	SNMP
	Power not present	Critical	SNMP
	Power redundancy lost	Warning	SNMP
	Power threshold exceeded	Warning	SNMP
	security override engaged		SNMP
	self test error		SNMP
	Server power off	Critical	SNMP
	Server power on		SNMP
	Server power on failure		SNMP
	Server reset		SNMP
	Temperature status degraded	Warning	SNMP
	Thermal condition not ok	Warning	SNMP
	Thermal confirmation		SNMP	Up again after thermal shudown
switch	Link down	Critical	SNMP
switch	Link up	Info	SNMP

Describing faults

Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some OpenStack internal fault will always be caught the same way.

Wiki

Table of Contents

Faults

Initial list of faults

High level list of faults:

Describing faults

Wiki

User Tools

Site Tools

Table of Contents

Faults

Initial list of faults

High level list of faults:

Describing faults

Page Tools