This is an old revision of the document!

Faults

Initial list of faults

Faults can be gathered by enabling SNMP and installing some opensource tool to catch and poll SNMP. When using for example Zabbix one can also put agent running on host to catch any other fault. Here is some initial list of high level faults and how they can be caught. List assumes that one enables usage of SNMP and then would use tool like Zabbix. There is also Pacemaker mentioned if used. Usage of that is limited to number of nodes, so it works better only for controller nodes.

Where	Description	Method	Comment
chassis	Blade not present	SNMP
chassis	Chassis fan degraded	SNMP
chassis	Chassis fan failed	SNMP
chassis	Chassis fan not present	SNMP
chassis	Chassis manager degraded	SNMP
chassis	Chassis manager failed	SNMP
chassis	Chassis manager not present	SNMP
chassis	Chassis power degraded	SNMP
chassis	Chassis power failed	SNMP
chassis	Chassis power input line status error	SNMP
chassis	Chassis power not present	SNMP
chassis	Chassis removal	SNMP
chassis	Network connector not present	SNMP
disk array	Disk array error	SNMP
libvirt	State of a virtual machine has changed	SNMP
openstack	Openstack service is in failed state	zabbix agent
openstack	Openstack status	zabbix agent
openstack	Openvswitch daemon is not in active state	zabbix agent
openstack	Openvswitch status	zabbix agent
os	Available memory too low	zabbix agent
os	Free FS space is less than 10% on volume {#FSNAME}	zabbix agent
os	Host information has changed	zabbix agent
os	Processor load too high	zabbix agent
os	System has restarted	zabbix agent
os	Zabbix agent is unreachable
pacemaker	Corosync is not in active state	SNMP	controller node, as limited?
pacemaker	Pacemaker is not in active state	SNMP	controller node, as limited?
pacemaker	Pacemaker node {#NODENAME} status has changed on {HOST.NAME}	SNMP	controller node, as limited?
pacemaker	Pacemaker PCS daemon is not in active state	SNMP	controller node, as limited?
pacemaker	Pacemaker resource {#RESOURCENAME} status has changed on {HOST.NAME}	SNMP	controller node, as limited?
server	Cold start	SNMP
server	Cpu condition not ok	SNMP
server	Fan degraded	SNMP
server	Fan failed	SNMP
server	Fan not present	SNMP
server	Fan redundancy lost	SNMP
server	HW SNMP agent authentication failure	SNMP
server	Network adapter connectivity lost	SNMP
server	Memory condition not ok	SNMP
server	Post error	SNMP
server	Power degraded	SNMP
server	Power failed	SNMP
server	Power not present	SNMP
server	Power redundancy lost	SNMP
server	Power threshold exceeded	SNMP
server	security override engaged	SNMP
server	self test error	SNMP
server	Server power off	SNMP
server	Server power on	SNMP
server	Server power on failure	SNMP
server	Server reset	SNMP
server	Temperature status degraded	SNMP
server	Thermal condition not ok	SNMP
server	Thermal confirmation	SNMP	Up again after thermal shudown
switch	Link down	SNMP
switch	Link up	SNMP

Describing faults

Many of the faults needs to be configurable while others not. Hardware faults especially might need different triggers in different HW while some Openstack internal fault will always be caught the same way.

Wiki

Table of Contents

Faults

Initial list of faults

Describing faults

Wiki

User Tools

Site Tools

Table of Contents

Faults

Initial list of faults

Describing faults

Page Tools