This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
doctor:faults [2015/02/06 16:14] Gerald Kunzmann [Table] |
doctor:faults [2015/02/17 15:24] (current) Gerald Kunzmann [Initial list of faults] |
||
---|---|---|---|
Line 3: | Line 3: | ||
===== Initial list of faults ===== | ===== Initial list of faults ===== | ||
- | Faults can be gathered by enabling SNMP and installing some opensource tool to catch and poll SNMP. When using for example Zabbix one can also put agent running on host to catch any other fault. Here is some initial list of high level faults and how they can be caught. List assumes that one enables usage of SNMP and then would use tool like Zabbix. There is also Pacemaker mentioned if used. Usage of that is limited to number of nodes, so it works better only for controller nodes. | + | Faults in the listed elements need to be immediately notified to the VNFM in order to perform an immediate action like live migration or switch to a hot standby entity. In addition, a maintenance action should be triggered to, e.g., reboot the server or replace a defect hardware element. |
- | Faults can be of different **severity**, i.e. critical, warning, maintanance, or info. //Critical// faults require immediate action as a severe degradation of the system has happened or is expected. //Warnings// indicate that the system performance is going down: related actions include closer (e.g. more frequent) monitoring of that part of the system or preparation for a cold migration to a backup VM. Type //maintenance// may trigger maintenance actions like a re-boot of the server or replacement of a faulty, but redundant HW. //Info// messages do not require any action. | + | Faults can be of different severity, i.e. critical, warning, maintenance, or info. Critical faults require immediate action as a severe degradation of the system has happened or is expected. Warnings indicate that the system performance is going down: related actions include closer (e.g. more frequent) monitoring of that part of the system or preparation for a cold migration to a backup VM. Type maintenance may trigger maintenance actions like a re-boot of the server or replacement of a faulty, but redundant HW. Info messages do not require any action. |
- | Faults in the listed elements need to be immediately notified to the VNFM in order to perform an immediate action like live migration or switch to a hot standby entity. In addition, a maintenance action should be triggered to, e.g., reboot the server or replace a defect hardware element. | + | Faults can be gathered by, e.g., enabling SNMP and installing some open source tools to catch and poll SNMP. When using for example Zabbix one can also put an agent running on the hosts to catch any other fault. Table 1 provides a list of high level faults that are considered within the scope of the Doctor project requiring immediate action by the VNFM. |
- | === Proposal for high level list of faults: === | + | === High level list of faults: === |
^ Service ^ Fault ^ Severity ^ How to detect? ^ Comment ^ Action to recover ^ | ^ Service ^ Fault ^ Severity ^ How to detect? ^ Comment ^ Action to recover ^ |