For any of us who manage an Active Directory infrastructure, you know what a DC is. You breathe them, eat with them, sleep with them…. Ok that’s too far but you get my track and groove with me. For anyone who thinks I’m talking about a city, let me fill you in. DC is short for, ?slang for?, anyway means domain controller and is a nifty little big guy on campus. It holds all of the objects which are members of the domain. Kind of like a corporate census for user accounts, computers, printers, shares, psst this list can get kinda long, within a particular logical e-city.
With the force to be reckoned with that is known as the virtual environment taking over everywhere, we jumped on the train and put up all of our domains a hub DC at our resource site. There are many reasons for this but the main one was to use this DC for each domain for authentication of the centralized services, back up sys states / DNS zones for DR (disaster recovery) scenarios, LDAP queries, you name it we used them. We were too used to the remote DCs mysteriously going offline, being shut off, having an app installed on them and well yeah… They were our safe haven, our point of survival and a blessing we had not had before. Unfortunately, they proved this past weekend to be a point of great disappointment.
Now I blame no one, mind you. This can happen to anyone at anytime. But it did suck for lack of a more professional word. The MSA that homed all of our hub DCs decided it was its time. This took authentication down, replication ceased between sites (and there are about 20) and just reeked havoc with email and any client trying to do anything (DNS also resides on the DCs). Two of our root DCs were also in this bunch of bananas. Luckily there are four root DCs and every domain had at least 3 or more DCs to itself. This allowed for a solid suture of everything within a couple of days, albeit Sites & Services is a little ugly. Services could resume after a reboot so that they could chose a new DC to bow down to and say ‘please can I have some more’. But it was a well learned lesson in how to do a massive forest clean up while still keeping things going.
I hate saying this was a good thing, because do not mistake me it was bad. But looking at the other storage environments which also hold critical services, this was the one we could loose with the least amount of impact and a limited window of downtime.
So what’s the moral of this story? Mess happens just be sure you are always on stand-by with a broom and a plan of attack.