Wiki

Fault Management and Maintenance:

Proposed name for the project: Doctor
Proposed name for the repository: repo-name
Project Categories: (Documentation, Requirements, Collaborative Development)

Project description:

The manager of a resource pool or cloud computing platform obviously requires to detect faults that affect the proper functioning of the Virtual Machines (VMs). OpenStack, a prominent candidate for such resource pool/cloud manager, at present doesn’t detect specific hardware faults which are critical to the lifecycle management of virtualized telecom network functions running in the VMs. This proposal reflects operators’ real-life operational requirement on OpenStack which needs to detect failures in the NFVI, such as hardware faults, failures of hypervisor or host OS or other controllers (storage or network) or software components providing the virtual resources (Step 1, Fig. 1). After detection, OpenStack, from its Physical Machine (PM) – VM – VM user/client information, will detect the appropriate user/client of the affected VMs (Step 2, Fig. 1). Then, through the northbound interface to be designed, OpenStack needs to inform the user/client (VNF Manager in ETSI NFV GSs) of the affected VMs (Step 3, Fig. 1). The issue over here is that OpenStack shouldn’t unilaterally initiate a recovery mechanism as telecom network functions (e.g. MME, S/P-GW) often have an active-standby configuration. Upon reception of a fault notification from OpenStack, the responsible user/client i.e. VNFM may perform application-level reconfiguration e.g. switch over to a standby instance. The procedure at the user/client is outside the scope of OPNFV. This project also provides repair capability in VIM. Since there are many types of failure and repair actions to be taken, use-cases will be categorized and detailed in this project document. Also any missing functions in VIM (OpenStack) will be developed. Then, after necessary application level re-configuration, the user/client of the VMs will notify OpenStack through the OpenStack northbound interface (Step 4, Fig. 1) that OpenStack can now take actions on the fault-affected VMs. Upon receiving such instruction, OpenStack can execute its existing capabilities e.g. VM migration to recover the affected VMs (Step 5).

To summarize, from OpenStack, the following three new features are required,

detect or receive notification on resource pool (e.g. hardware, hypervisor, host OS, storage, network controller or other NFVI software ) faults
inform the user/client of the affected VMs about the faults
receive instruction on how the affected VMs need to be recovered

In addition, Operators also periodically perform maintenance of its resource pool which encompasses PM replacement, hypervisor update/upgrade, host OS update/upgrade etc. For this, a particular or a plurality of PMs need to be emptied i.e. no VM running on them or using them. From OpenStack point of view, only one feature is necessary on top of the Fault Management requirement described above. This feature is receiving a maintenance notification/instruction through the northbound interface which point to particular PM/PMs (Step 1, Fig. 2). Step 2-5 in this resource pool maintenance scenario is identical to Step 2-5 in the Fault Management scenario, except notification in step 3 has different semantic from that of Fault Management.

To summarize, the Maintenance scenario requires the following feature on top of the required features explained in Fault Management,

receive maintenance notification/instruction to empty particular PMs

Scope:

Describe the problem being solved by project
- OpenStack lacks the capability of detecting hardware faults, which affects the proper functioning of a telecom virtualized network function (VNF), and notifying those failures to VNFMs who own affected resources as soon as possible. OpenStack also lacks the capability of autonomously performing Operators periodical resource pool/cloud platform maintenance process. These two problems where the solutions hae large overlap with each other, would be solved.

Specify any interface/API specification proposed,
- A high level interface functionality/feature would be proposed which would be further detailed to a uniform interface specification in the Collaborative Development phase.

Identify a list of features and functionality will be developed.
- NFVI shall be able to detect resource failures (details to be defined)

This may affect various upstream projects within NFVI. In fact every upstream project within scope of NFVI that provides some resources needs to provide some fault detection for such resources, or the fault detection must be done via a separate tool. Please see a few examples here:

ODL should be able to monitor and report faults of network resources, e.g. switch outage.
KVM or OVS might need to detect and report faults
If there is some hardware acceleration or DPI engine it might need to detect and report faults
Out of band hardware management (e.g. OpenIPMI or OpenHPI) if available is already able to detect and report faults, but we need to check their interfaces for collection and reporting of such faults
It might be good to add popular system monitoring tools (e.g. Zabbix, Nagios or such)

The Doctor project will analyze and select the necessary mechanisms and affected upstream projects.

NFVI shall be able to report failures to the VIM (OpenStack)
- The Doctor project will select and define the necessary interfaces.
OpenStack be able to receive hardware fault notifications when faults occur
OpenStack be able to detect virtual machines (VMs) affected by the fault
OpenStack be able to detect the users/clients i.e. VNFM of the VMs
OpenStack be able to inform the users/clients i.e. VNFM of the VMs about the fault immediately
OpenStack be able to receive instruction from the users/clients i.e. VNFM on what to do with the affected VMs
OpenStack be able to receive maintenance instruction for NFVI resources, that is per entity providing the resources, e.g. physical servers, hypervisors, controllers of network or storage resources, …

Identify what is in or out of scope. So during the development phase, it helps reduce discussion.
- In scope
  - Fault detection function for NFVI resources
  - Northbound interfaces (Vi-Vnfm, Or-Vi) from/to OpenStack
  - Southbound interface to OpenStack if resource fault detection function is external to OpenStack
- Out of scope
  - User/Client i.e. VNFM side implementation
  - Existing fault detection mechanisms between VMs and their virtual resources via the Vn-nf

Describe how the project is extensible in future
- This project proposal will provide initial set of fault events and maintenance events as deemed necessary for Operators network operation. As new events appear, such new events would be incorporated

Testability: ''(optional, Project Categories: Integration & Testing)''

N/A (Integration with OPNFV is TBD based on development in upstream projects)

Documentation: ''(optional, Project Categories: Documention)''

API Docs and Northbound Interface specification Docs
Functional block description

Dependencies:

Identify similar projects is underway or being proposed in OPNFV or upstream project
- N/A
Identify any open source upstream projects and release timeline.
- OpenStack, L-release (Oct, 2015)
  - Our plan is to engage upstream projects at an early stage, concurrently with requirement phase, to seek feasible implementation and get our supporters, so that our concept and implementation are accepted by the upstream community.
  - Based on requirement study, our concept has to be introduced and discussed in the L-series design summit. To do that, we have to push our topic in the L-release design summit arrangement (April 2015).
  - The specification should be approved by spec freeze (July 2015).
  - The implementation (code) should be merged by feature freeze (Sep 2015).
- TBD for other affected upstream projects e.g. in NFVI

Identify any specific development be staged with respect to the upstream project and releases.
- Code development for
  - Detection of fault events
  - Internal database to keep physical machine-virtual machine-client/owner mapping
  - Informing the client/owner of the fault about affected VMs through the northbound interface
  - Receive acknowledgement/instruction on affected VM handling
  - Receive maintenance instruction on Physical Machines
Are there any external fora or standard development organization dependencies. If possible, list and informative and normative reference specifications.
- ETSI NFV MANO GS, ETSI NFV INF GSs

If project is an integration and test, identify hardware dependency.
- Although this project proposal doesn’t include the consequent integration and testing in the OPNFV integration platform at this stage, the artifacts of this project shall be hardware independent.

Committers and Contributors:

Name of and affiliation of the maintainer:
- Ashiq Khan (DOCOMO, khan@nttdocomo.com)
Names and affiliations of the committers:
- Carlos Goncalves (NEC)
- Dirk Kutscher (NEC)
- Jarmo Virtanen (Nokia)
- Petri Kemppainen (Nokia)
- Ryota Mibu (NEC)
- Tapio Tallgren (Nokia)
- Tomi Juvonen (Nokia)
- Uli Kleber (Huawei, ulrich.kleber@huawei.com)
- Zhangyu (Huawei, zhangyu11@huawei.com)
Any other contributors:
- Palani Chinnakannan (Cisco)
- Peter Lee (ClearPath)
- Serge Manning (Sprint)

Planned deliverables

Described the project release package as OPNFV or open source upstream projects.
- Documentation of the features shall be completed by March 2015
- Developed code shall be released as part of OpenStack; potentially Nova, Neutron, Cinder, but this list could be changed by requirement survey.

If project deliverables have multiple dependencies across other project categories, described linkage of the deliverables.
The project will have mainly the two following steps:
- Step 1: Requirement phase
  - Detailing of the requirements of Fault Management and Resource pool maintenance
- Step 2: Collaborative development phase
  - Development of code in upstream project

Documentation phase will run in parallel to both of the above-mentioned steps.

Proposed Release Schedule:

When is the first release planned?
- Documents with basic API and Norhtbound Interface specification by March, 2015
- API framework and initial implementation within OpenStack and relevant upstream projects by September, 2015
Will this align with the current release cadence
- Yes.