Non-violent communication with your highly available Linux cluster

von Lars Marowsky-Bre (SUSE Linux Products GmbH)

Linux clusters, based on the now widely adopted corosync and Pacemaker stack, are in every data center nowadays. They protect the data and services such as databases, SAP systems, firewalls, virtual machine hosting, or even air traffic control and many more against service and hardware faults.

Such clustered systems provide significant availability gains for mission- and business-critical services, compared to single node systems - and if done right. Necessarily, such distributed services imply higher complexity, which makes log file analysis and troubleshooting the system non-trivial. The continuous and timing-sensitive use of CPU, storage, and network resources also tends to discover even rare and transient faults in the environment.

Based on 14 years of experience at SUSE, we will discuss best current practices; discuss how to avoid common pitfalls both from a design and implementation perspective; and introduce the audience to the tools that the cluster stack provides for tracing and debugging.

The target audience are system administrators managing or designing HA clusters on Linux.

Über den Autor Lars Marowsky-Bre:

Lars works as the architect for the SUSE Linux Enterprise High Availability product. He contributes to various High Availability projects on Linux and is a frequent speaker at conferences. He finds disaster resilience, business continuity, and distributed systems strangely fascinating, and enjoys thinking up ways of how things will go wrong. Previous roles include network administrator, consultant, and team lead of one of the SUSE Labs's kernel teams.

As a Linux user since 1994, he is passionate about Free and Open Source software. He holds a master of science degree from the University of Liverpool and was named a SUSE Distinguished Engineer in 2013. He wears black.