Table of Contents

Fault Management and Maintenance:

Project description:

The manager of a resource pool or cloud computing platform obviously requires to detect faults that affect the proper functioning of the Virtual Machines (VMs). OpenStack, a prominent candidate for such resource pool/cloud manager, at present doesn’t detect specific hardware faults which are critical to the lifecycle management of virtualized telecom network functions running in the VMs. This proposal reflects operators’ real-life operational requirement on OpenStack which needs to detect failures in the NFVI, such as hardware faults, failures of hypervisor or host OS or other controllers (storage or network) or software components providing the virtual resources (Step 1, Fig. 1). After detection, OpenStack, from its Physical Machine (PM) – VM – VM user/client information, will detect the appropriate user/client of the affected VMs (Step 2, Fig. 1). Then, through the northbound interface to be designed, OpenStack needs to inform the user/client (VNF Manager in ETSI NFV GSs) of the affected VMs (Step 3, Fig. 1). The issue over here is that OpenStack shouldn’t unilaterally initiate a recovery mechanism as telecom network functions (e.g. MME, S/P-GW) often have an active-standby configuration. Upon reception of a fault notification from OpenStack, the responsible user/client i.e. VNFM may perform application-level reconfiguration e.g. switch over to a standby instance. The procedure at the user/client is outside the scope of OPNFV. This project also provides repair capability in VIM. Since there are many types of failure and repair actions to be taken, use-cases will be categorized and detailed in this project document. Also any missing functions in VIM (OpenStack) will be developed. Then, after necessary application level re-configuration, the user/client of the VMs will notify OpenStack through the OpenStack northbound interface (Step 4, Fig. 1) that OpenStack can now take actions on the fault-affected VMs. Upon receiving such instruction, OpenStack can execute its existing capabilities e.g. VM migration to recover the affected VMs (Step 5).

To summarize, from OpenStack, the following three new features are required,

In addition, Operators also periodically perform maintenance of its resource pool which encompasses PM replacement, hypervisor update/upgrade, host OS update/upgrade etc. For this, a particular or a plurality of PMs need to be emptied i.e. no VM running on them or using them. From OpenStack point of view, only one feature is necessary on top of the Fault Management requirement described above. This feature is receiving a maintenance notification/instruction through the northbound interface which point to particular PM/PMs (Step 1, Fig. 2). Step 2-5 in this resource pool maintenance scenario is identical to Step 2-5 in the Fault Management scenario, except notification in step 3 has different semantic from that of Fault Management.

To summarize, the Maintenance scenario requires the following feature on top of the required features explained in Fault Management,

Scope:

This may affect various upstream projects within NFVI. In fact every upstream project within scope of NFVI that provides some resources needs to provide some fault detection for such resources, or the fault detection must be done via a separate tool. Please see a few examples here:

The Doctor project will analyze and select the necessary mechanisms and affected upstream projects.

Testability: ''(optional, Project Categories: Integration & Testing)''

N/A (Integration with OPNFV is TBD based on development in upstream projects)

Documentation: ''(optional, Project Categories: Documention)''

Dependencies:

Committers and Contributors:

Planned deliverables

Documentation phase will run in parallel to both of the above-mentioned steps.

Proposed Release Schedule: