Wiki

This is an old revision of the document!

Doctor Requirements

This section is under development. You can find current doc at HERE .

Feature

Detect unavailability of physical resources (receive failure/maintenance notification from various functions)
Identify affected virtualized resources
Notify unavailability of virtualized resources to the owner
Execute actions to process fault recovery and maintenance

Unavailability of physical resource

Unavailability of physical resource is detected by various functions monitoring and/or managing individual H/W and S/W components. The cause of unavailability of physical resource to detect shall be configurable.

What do we need to detect?

Critical failure to provide virtualized resource
- Fault of devices on physical machine: CPU, Memory, Disk, IPMB Bus, Fan and Management Card
- External storage faults
- Host OS error: Kernel, File System, Block Device, Boot, etc.
- Hypervisor error
- Physical/virtual switch/port failure
- Network error (link down, communication error)
- Unexpected power down and system halt
Warning (will take some action soon)
- Device warning: Abnormal Temperature and Abnormal Voltage
- Failure of monitoring or management function such as machine/chassis Management Card
- Other failure of daemon such as NTP, health-check agent
- Undesirable State (e.g. one of the bound link down)
- Something wrong (frequent network flooding)
Maintenance of Host
- Upgrade of hypervisor or other software on host
- Host OS upgrade and reboot
- Patching on host OS or other running software
- Firmware update
- Replacement of machine or device
Retirement
Other?
- Controller Process?
More details about faults in here

TODO: Check terminology fit in that of INF GSs.

Note: “Existing fault detection mechanisms between VMs and their virtual resources via the Vn-nf” are out of scope.

What we need to integrate to detect those causes and candidate to analyze?

H/W manager (control PDU, EMC or BMC through IPMI, OpenHPI, etc.) # Ironic?
Hypervisor (KVM)
vSwitvh (OVS)
Hardware accelerator or DPI engine which can report failure
System monitoring tools (Zabbix, Nagios, own scripts, etc.)
Storage Controller
Network Controller (e.g. OpenDaylight)
OSS (maintenance notification sender)

Unavailability of virtualized resource

Unavailability of virtualized resource is found by referring the map of physical and virtualized resource. The cause of unavailability of virtualized resource could be different in some cases, so the relation from physical resource to virtualized resource shall be configurable.

What we need to map from physical resources?

Compute
- VM including CPU, memory and ephemeral disk (“server” in OpenStack term)
Storage
- Virtual block disk (“volume” in OpenStack term)
Network
- Virtual network (“network” in OpenStack term)
- Virtual port (“port” in OpenStack term)

How describe unavailability of virtual resources?

Type (e.g. “server”, “volume”, “network”)
Event ID?
Current status i.e. UP, DOWN, ERROR and UNKOWN
Flag or time to be stopped or deleted by VIM
Free format description e.g. VM crash, Virtual network ports errors, Storage disconnection, VM retirement

Notification

There are two types of notification; event of virtualized resource and update of capacity of Resource Pool. All notification should be transferred immediately to minimize network service stall and to avoid over assignment caused by delay of capability update.

Event of virtualized resource is description of unavailability to inform the user (VNFM or Orchestrator). Flexibility of notification is important; receiver function in user-side implementation could have different schema, location and policy (receive or not, aggregate events in the same cause, etc.).

Capacity update should be calculated in VIM and send latest capability to the user or administrator (VNFM or Orchestrator).

Action

All actions, done by VIM and NFVI after those notifications, should be instructed by the owner of resources or administrator of infrastructure. Instructions are not always required after those notifications. A delegated action could automatically proceed if it already instructed: e.g. VIM can automatically evacuate VM which is labeled like ‘allow live-migration’ by the owner.

Note: “User/Client i.e. VNFM side implementation” is out of scope.

Actions to complete fault management and maintenance in VIM and NFVI:

Delete affected virtual resources
Recreate affected virtual resources
Create new virtual resources
Fence physical resources (e.g. power down uncertain hosts)
Receive acceptance or readiness of resource pause or down

Problem Description (Gap Analysis)

NOTE: we are working in etherpad page .

Wiki

User Tools

Site Tools

Table of Contents