A ChatOps primer: automation in support operations
Microsoft Teams, Slack and other real-time communication tools may be commonplace now, but they certainly aren’t common practice. Often, they’re provided simply for cultural benefits of engaging a distributed workforce, or worse, relegated to mere watercooler conversation that eventually becomes a dried-up waterhole. Chat platforms can do more than facilitate conversation. If you’ve heard of DevOps, you already know that automation is part of the development pipeline of continuous integration, testing and deployment. Automation can also be a significant feature of operations, and that’s where ChatOps comes in.
What is ChatOps?
Previously, when I’d heard the term chatops, I didn’t think of automation, I thought of a support team coming together in a chat window in response to an incident or to chase down an answer. But that’s what swarming is, and I’ll cover that in a future post. Rather, ChatOps is where reporting and monitoring alerts join with other service management BAU on the chat platform. Some of the automated outputs to chat might include checks on environment configuration, rollback, and server builds; and cost alerts from dynamic cloud infrastructure. With timely access to this information, along with the integrated incident tracking and real-time problem-solving, a cross-functional response can quickly deal with any issues that arise, solving the back-and-forth between silos that’s endemic to enterprise.
In the interests of balance, there are downsides, such as those described in this post, on the HeavyMelon blog. Where ChatOps becomes the norm and replaces the familiarity with a service’s native interface, it can cause delays in the event that chat in unavailable. Or, where the script to perform the automation appears to be working, but the service at the other end is not, you may be carrying on in blissful ignorance of downtime in your ecosystem. And of course, chat gets noisy at large scale.
Like anything else, getting the value from ChatOps requires a coordinated effort to automate the low-risk, high-volume operational tasks, while not collapsing under the same kind of alert-fatigue we’re used to getting from our inbox and monitoring platforms.