Slides

Source

Compute Node HA in OpenStack - Theory (HO128394.pdf)

Compute Node HA in OpenStack - Hands On (HO128394-lab.pdf)

Compute Node HA Training (May 2016) (http://suse.github.io/compute-ha-training/#/about)

Summary


# Typical HA Control Plane

- Automatic restart of controller services

- Increases uptime of cloud


# Compute failure


# When is Compute HA important?

- Pets vs Cattle

- Pet: Service downtime when a pet dies

- Pet: VM instances often stateful, with mission-critical data

- Pet: Needs automated recovery with data protection

- Cattle: Service resilient to instances dying

- Cattle: Stateless, or ephemeral (disposable storage)

- Cattle: Already ideal for cloud ?but automated recovery still needed!


# If compute node is hosting cattle

- to handle failures at scale, we need to automatically restart VMs somehow.


# If compute node is hosting pets

- we have to resurrect very carefully in order to avoid any zombie pets!


# Do we really need compute HA in OpenStack?

- Compute HA needed for cattle as well as pets

- Valid reasons for running pets in OpenStack

- Manageability benefits

- Want to avoid multiple virtual estates

- Too expensive to cloudify legacy workloads


# Architectural Challenges

- Configurability

- Compute Plane Needs to Scale

- Full Mesh Clusters Don't Scale


# Addressing Scalability

- The obvious workarounds are ugly!

- Multiple compute clusters introduce unwanted artificial boundaries

- Clusters inside / between guest VM instances are not OS-agnostic, and require cloud users to modify guest images (installing & configuring cluster software)

- Cloud is supposed to make things easier not harder!



# Compute HA in SUSE OpenStack Cloud

- OCF (Open Cluster Format)

- Pros

    Ready for production now

    Commercially supported by SUSE

    RAs upstream in openstack-resource-agents repo:

    https://github.com/openstack/openstack-resource-agents/tree/master/ocf

- Cons

    Known limitations (not bugs):

    Only handles failure of compute node, not of VMs, or nova-compute

    Some corner cases still problematic, e.g. if nova fails during recovery


# Shared Storage

- Where can we have Shared Storage? (Two key areas)

    /var/lib/glance/images on controller nodes

    /var/lib/nova/instances on compute nodes

- When do we need Shared Storage?

    If /var/lib/nova/instances is shared, VM's ephemeral disk will be preserved during recovery

    Otherwise: VM disk will be lost, recovery will need to rebuild VM from image

    Either way, /var/lib/glance/images should be shared across all controllers (unless using Swift / Ceph)

    Otherwise nova might fail to retrieve image from glance