Global Update Manager in Win Failover Clusters

Yesterday we experienced some issues in a large hyper-v cluster that had nodes evicting and restarting the cluster service resulting in VM´s beeing restarted on other nodes and that is not great when trying to have a High Available service for the end users.

Reading about the Global Update Manager and how it works and also getting help from the Microsoft CSS helped us getting out of the issue. In default async mode in 2012 R2 the cluster update is commited when a majority of the nodes have processed it, when reading the cluster state the nodes need to check with a majority of the nodes once again to get a valid state so this means more traffic. 

The problem is when you have a large Hyper-V  2012 R2 cluster with lots of cluster resource updates and VMM, SCOM agents hammering the cluster database function gets lots of traffic and that can be cumbersome and at last it might start going bananas and evicting hosts that does not respond in time. Your logs will start filling up with event 5377 and 1135.

GUM cluster synchronous mode

There have been some work under the covers and now the cluster database read write mode has changed to a default of synchronous mode in a Hyper-V 2016 cluster. When doing it synchron it means all nodes in the cluster will process the state and that also means that all nodes have the latest info and thus can read it locally and that means less net traffic!

Default Behaviours in Clusters

Windows Server 2012 R2

Get-Cluster | fl DatabaseReadWriteMode

DatabaseReadWriteMode : 1

Windows Server 2016

Get-Cluster | fl DatabaseReadWriteMode

DatabaseReadWriteMode : 0

ref: https://windowsprivatecloud.wordpress.com/about/configure-the-global-update-manager-gum-mode-in-wfc/

Once we changed to the DatabaseReadWriteMode 0 the cluster became stable.