Over my time using SCOM 2012 I have found that there is still some confusion about how the core of a SCOM Management Group, now with the Resource Pools, actually functions. You may have heard of the Perfect Failover Service before.
I have delved around for documentation and found that public information for this is not commonplace.
This post from Rob Kuehfus has good information the changes in topology between SCOM 2007 and SCOM 2012:
It covers off Resource Pool nicely but it still does not actually explain, in a way I like, how the Management Group Failover Service works.
So here is my explanation of how this works. I am aware that potentially not all permutations and combinations will be covered in writing as I did not wish to drop down a deep Rabbit Hole with every eventuality being covered in minutiae. My endeavour is that from this post you, dear reader, will be equipped with the understanding to be able to look at your own environment understanding how failover in the All Management Servers Resource Pool works and run all the permutations yourself to see if you are in “a good place”.
OK, lets get started.
SCOM can be crafted with all of its components installed onto one server. SQL instance, management server, console, SSRS – you get the picture. It will likely have lots of issues but it will work. One thing it will not be however is Fault Tolerant!
En Passant, I may be being a pedant but there is a distinction between Fault Tolerance (FT) and High Availability (HA). Here I am talking about a level of FT not making your whole SCOM solution HA.
In Rob Kuehfus Blog, towards the end it states
“Moving forward the Product Group recommendation will always be to have two management servers in a Management Group at all times. By doing this you will always have High Availability for your management group and a much easier recovery during a disaster.”
Right, so I need to make my single server management group more fault tolerant so I will follow the advice and add a second management server. This will automatically be included in the All Servers Resource Pool (assuming I have not changed any setting on membership from the defaults). Now I will have my original server with ALL the SCOM roles and a additional second one with only the management server role installed. OK, so I am there now?
My answer is: No, but it is a lot closer!
I have gone from Bad to Good. But now to go from Good to Better….
To get from Good to Better I really need to separate out more of the SCOM roles. SQL is not (in a production level environment at any rate) recommended to be installed onto a management server. So I should split this off. I now have a management group with two management servers and a separate SQL server. This is Better.
Now, how does splitting these roles out help with the Perfect Failover Service?
To help explain this I am going to enter into one of my Analogies.
In the words of the Paul Daniels, You'll like this, ……..
The Resource Scenario:
Every hour I like to have coffee and biscuits which I share with my office friends. So I set my schedule and on the hour get coffee and biscuits for them. This is great but the treat stops if I am ill or something – no one gets coffee and biscuits.
(The single management server has gone offline so the management group can’t function).
So to mitigate this I am going to get a friend to help (lets call him Ruben). Now Ruben could just be a passive standby. Does not do any coffee and biscuity stuff when I am around, only if I am unavailable.
(This is principle behind the clustering of a 2007 RMS)
Ideally what I really want is to spread the load rather than just doing the task myself. Having Ruben sitting around “just in case” is a waste. Lets split the task, I will always do the coffee, Ruben will always do the biscuits.
(The idea behind SCOM 2012, with the Resource Pools. The servers in the resource pools share out the management group tasks)
Back to the biscuit task! The task will run smoothly enough but what happens when either I or Rubens is unavailable? We need to have some way to confirm with each other that we will (or will not) be able to do our half of the process.
Lets set up a system that on each hour Ruben will contact me to make sure I am on for the coffee and I will confirm with a reply. Likewise I will contact Ruben to confirm he is on for the biscuits and he will reply. (OK, that bit is a little contrived but run with it). If I don’t get a reply from Ruben I will get the biscuits as well as the coffee. And visa versa.
That has solved the problem yes? Well no, not really. What if we are both still coming for coffee and biscuits and it is just our communication connection that is down between us. We will each turn up with both coffee and biscuits (In some peoples worlds I am aware that this would not actually be a problem but for the sake of this, it is). The job has been done twice which could cause an issue.
What we need is another reference point, some mechanism by which we can confirm the situation. So we ask Kate to help. She will act as a third point of reference. If I cannot communicate to Ruben I will go to Kate for a confirmation and she will try to reach him too. If she can communicate to him she can let me know and I will continue to bring coffee knowing that Ruben will bring the biscuits. If Kate can’t get to Ruben then I will know I will need to do both tasks to bring coffee and biscuits.
Kate is our Observer, if you will.
This does need a little extra in the procedure to make it more like Mary Poppins.
If Ruben is actually in the office but cannot communicate with me he will also try Kate. If he cannot get a reply from either of us, even though he is actually available we have agreed he will stop his coffee and biscuity questing as he knows I will be doing it. And visa versa.
(Yes – this process does give us the situation that if Kate is unavailable and Ruben and I cannot “talk” we will both stop our half of the overall task and we will all be sad because no one gets coffee and biscuits . But that is a potential outcome of this process and that is how it is supposed to works). Maybe we need to make Kate’s role Fault Tolerant too – but another Rabbit hole beckons at this point so I will stop here.
There is a “Resource Pool” in place with Ruben and I in it and we have a third “Observer”, Kate, to insure Quorum.
We now have a fail over system of sorts in place.
The Perfect Failover Service in Operations Manager
If you have got this far we will now apply that to SCOM.
The tasks that a management group once performs only on the RMS in Operations Manager 2007 are now spread over the servers in the All Server Resource Pool in Operations Manager 2012.
All the servers in this pool communicate back and forth with each other to make sure that they are all available to carry out their assigned tasks. This process means that the management servers need to be on low latency connection with each other.
If we only have two management servers we need an observer node so we can get a quorum. In a two management server scenario this is the Operations Manager database.
If a management server loses communication with its partner then it will contact the database to verify that the other management server is actually offline for this observer node as well. If it finds that to be the case it will assume that its partner management server is down and take over the tasks assigned to it.
If the failed management server is actually dead then there is no real problem. If however it is still running but, say the network connection has be broken, this management server, if it cannot connect to its partner or the observer will assume that the other management server has taken all the tasks and so shut itself down (commit suicide).
The database server can exist on one of the two management servers, this gives fault tolerance for the SCOM services. We have gone from Bad to Good here.
This is however not a recommended configuration and having multiple roles on one management server will increase the risk of any one application failure taking down this management server and hence SQL as well. (No SQL and the management group will just plain not work). Move SQL off and we go from Good to Better.
Best would be things like clustered SQL, redundant network paths and all that other good stuff to get to a truly High Available solution.
In summary, the recommended configuration to get a level of Fault Tolerance in a SCOM 2012 management group you need:
Two management servers in the All servers Resource Pool and a separate Database Server as the observer node.
Three members of the system to enable quorum – a bit like in Minority Report.
This posting is provided "AS IS" with no warranties, and confers no rights. Use of included utilities are subject to the terms specified at http://www.microsoft.com/info/copyright.htm.