As part of a team of 7 DBA’s, each team member does an on-call rotation for one full week starting on Monday at 7:00am EST. While recapping my on-call rotation for the past week with my team, it dawned on me that it’s been a while since I've blogged, and this could be a good insight to the world of on-call for those that haven't yet had the pleasure.
The week before I started my rotation there were a lot of questions running through my mind. What does it mean to be on-call? What qualifies as an emergency vs. standard work requested off hours? How to I guarantee I can meet my SLA if I get stuck? As I expected those questions were answered in the next 7 days.
Being on-call means that you are the first line of defense. When a problem arises, you get the page and make first contact. It’s then up to you to assess the situation and define what steps are going to be taken. During standard office hours, this included an impromptu conversation with the primary DBA and, usually, resulted in that person resolving the problem and doing the root cause analysis. This is a privilege that you likely won’t have during off hours. Assuming that problems can absolutely happen twice, I then make a point to get together with the DBA once the problem is resolved to understand what actions were taken. Should this happen again off-hours or when they are not available I already have an idea how to resolve the problem.
DBAs quickly discover there is never a lack of work requested. When on-call, it is important to identify which requests require immediate attention and which can wait. It’s also important to understand that you, the DBA, and end-user requesting the work may not agree on the level of importance. My first instinct was to look at every email during off hours, thus allowing me to assess everything. After one night of sleeping in 30 and 45 minute intervals, I realized this was not effective. This begged the question: “What is coming to me via email and what is coming through as a page?” No! It’s not 1998 with pagers. For my company, the term page refers to a text message sent to the on-call DBA’s Blackberry. Pages are sent automatically by our monitoring tool or by engaging the emergency hotline. By reviewing and confirmation that I’m getting actionable alerts on my pager I was also confident that I didn’t have to put eyes on every email.
I’m sure every DBA is briefed on the up-time SLA’s for their environment within their first week of being hired. This is a timeline that every DBA keeps in the back of their head, as it can be the life-line our jobs live and die by. This can also be the single biggest point of stress for a DBA when something is broken. It’s also important to note that for many environments, the uptime SLA is different during critical office hours and non-critical office hours. For this blog, I’m focusing on non-critical hours. This is also where having a defined escalation policy is of the utmost importance. No one is happy to get called at 3:30am because you’re stuck, but I can assure you that it is always preferred over taking no action and dealing with the same problem during critical use hours. While no one wants to point out their short coming, I’ve never heard of a DBA getting terminated for waking up their manager, or even the manager’s manager in the middle of the night. Unfortunately, I can’t say the same for a DBA who takes no action when action is warranted.
So what didn’t I think of? How will this impact my home life? After the second night of being on-call, it was suggested to me that if I didn’t want to be hit in the head with a pillow every time I was paged, then maybe I should find other accommodations. This was something I absolutely overlooked. But I considered how I would feel if I were the one with the pillow and no reason to be woken and decided that using the guest room during rotations was a very doable sacrifice.
While being on-call for the first time or the time 100th can be stressful. Understanding the expectations and having an action plan make it a bit less daunting.
No comments:
Post a Comment