Using Messaging To Reduce MTTR (ChatOps To Fix Things Faster!)
If you aren’t using Messaging for your teams that diagnose and repair system issues you seriously need to rethink your methodology. Having worked as a Sales Engineer since late in 2010 I have gotten to see a lot of business models that depend on IT to deliver either a service or a product. And over time that Application Delivery Chain has gotten more complex and includes more and more moving parts. It is no longer as simple as a web server connected to an app server that is then connected to a database. The complexity and expectations have grown exponentially. And with that the need to communicate with context and share your firm’s collective knowledge is essential.
The largest benefit is that the team can now see a threaded conversation that is easier to digest and keep current than an e-mail thread. It also has history that can be scanned for the Postmortem. Another plus is that team members that join late don’t have to be verbally brought up to speed – something that can be distracting on a phone bridge. The initial gains come quick, but for a high-performing team there are some key tips and practices that can make them even more effective.
First suggestion I’d make is to have a channel dedicated for Emergency Management. Treat it as a very special place. No socializing, not random chatter – it exists strictly for tackling system issues. I’d also suggest a simple name or abbreviation that is clear and won’t be confused with anything else. So “EM” (Emergency Management) works better than “CC” (Command Center) since the later could be confused with Carbon Copy. “I’ll CC Jim about the performance problems we are seeing” is not as clear as “I’ll EM Jim . . . ”
For this channel have a schedule of who is in charge. Services like PagerDuty are great for organizing this responsibility. When that person comes on shift they should also update the channel description to include their name and the current status. So it might start as “Jim Smith – All Quiet” but then change to “Jim Smith – investigating performance issues in EMEA.” And the process of “passing the baton” needs to be clear and documented. The end of the shift is not a free pass to head for the door. Once the new person has come up to speed the two commanders agree when enough detail has been digested and it is okay to hand over command. At that time change the channel subject and have a message on the channel as to who is taking over.
When you are dealing with a small team the single channel works fine. But what about when you’ve got a giant organization that might have hundreds of people involved with troubleshooting an issue? In that case the manager of each team is the only one that speaks on the main EM channel. They also have a separate channel going for their team to discuss ideas and decide what is relevant to share on the main channel. It is also okay for the manager to delegate someone to share an observation or theory on the main channel also. In the end this will greatly reduce the noise, and allow each team to speak with one clear voice.
There also needs to be extreme discipline around usage of items like @everyone, @channel and @here. The channel leader should be the only one to use those handles. But the rest of the team should be comfortable getting on and DMing (Direct Messaging) the current commander. “Hey @jimsmith – I’m seeing database latency issues for the EMEA region that is impacting EUE. Should we start an EM?” (EUE = End User Experience.) At that point Jim can review the details and decide when to change the channel description or if an @channel message is merited. Things might start with an @here to until enough detail is known. This is especially critical since teams are often distributed around the globe and that @everyone on a key channel will wake people up.
Tool integration is also key for leveraging Messaging for Emergency Management. They make it easier to not have efforts duplicated and keep context out front. Items like commands that link to specific tickets in ticketing systems, display relevant charts or can restart systems (for folks with authority) can be very helpful. No more asking “how is the load on system23?” Instead a user can use a command to query the system load from the channel. Everyone sees the number, and it is tracked for the Postmortem. Slack has more than 1,000 prebuilt app integrations.
One excellent item to use this EM channel for is writing up the status details as well as deciding what to publicly update and when. The commander ultimately should be the one make the call, but the team can reach a consensus on the criticality of the event. This is also true for what to disclose and when to send the “all clear” message. A firm’s status page is the public view into their stability, how quickly they deal with issues and how well they communicate. Transparency is important, but you also don’t want to post anything that your competition can use to spread FUD – Fear, Uncertainty and Doubt.
Another minor tip is to consider pinning key items to the channel when there is an event going on. Anyone joining the channel is expected to review those pinned items before speaking up, unless a Manager has delegated them to get some relevant information to the team. (Being assumed that the Manager is familiar with what is pinned.)
A last item to consider is if you want to leverage external teams – that is allow people on the channel that are not actual employees. If you depend on third party systems and components it makes sense to have them there – provided that they’ve signed NDA agreements and are aware of the rules and processes in place. It greatly reduces context switching and removes the risk of “the telephone game.” Slack has a great webinar designed for admins that can give some insight to the features that are relevant for inviting external people. Of course some of the features are specific to Slack only, but either way it is worth the time to watch.
Having been involved with a multitude of firms with a SaaS offering I’ve gotten to see first hand how proper communication with context reduces MTTR and improves job satisfaction. If you’ve got other tips, please do let me know.