While most of the articles on the ZOLL Pulse Blog address issues with which our customers grapple every day, I’d like to change gears a bit and talk about an issue that, until recently, impacted our ZOLL Online customers once every month – our scheduled ZOLL Online releases. As longtime users are aware, over the past 18 months, our technical teams have invested a large effort to take our monthly scheduled releases from a one-hour outage window each month to releases with no downtime. We’ve gotten great feedback from the ZOLL Online community; achieving zero downtime was a significant milestone in the evolution of the platform for both our customers and our development teams. With a few zero downtime releases under our belt, I thought I’d take the opportunity to give you a behind-the-scenes look at how we got here; I promise I won’t get too technical.
A Little History
In the early days of ZOLL Online, our leadership team heard one key piece of customer feedback loud and clear: Release more frequently! One of the great things about developing web applications is the ability to have short development and release cycles. Shorter cycles gives us the flexibility to respond more quickly to issues, address defects and deliver new functionality on a timeline that helps customers realize the value of the ZOLL Online products that much faster. Developing and hosting applications on the web also eliminates the need for customers to juggle software upgrades. We handle the upgrades, and everyone gets upgraded at once. To that end, our leadership settled on a monthly interval for releases of products.
Releasing software that often offers a great number of benefits to our customers, but it also came with a cost: technical constraints required that we have an outage window each month to properly deploy updates. Our “early adopters” – customers who bravely stepped up to test our first-generation ZOLL Online products – probably remember the days of four-hour outages each month. Thank goodness those days are far behind us! When I joined the ZOLL family, the technical teams had reduced that outage window to just an hour each month. A significant improvement, to be sure, but still an interval that had a real impact on our customers.
Let’s face it – ZOLL Data isn’t producing social media or gaming software intended for casual use. We produce software that our customers use not only in business-critical applications, but more importantly, in life-critical scenarios. We take that responsibility seriously, and have worked hard to provide a platform that meets customer expectations on stability and availability. While our customers loved the frequency at which they were seeing software updates, they were less thrilled with the logistical overhead they had to bear during our releases. It took only a couple of calls and visits for me to see the extent to which our customers were working around our one-hour outages on release night; users needed to be informed, alternative plans for data transmission needed to be in place and access to stored data was limited. In short, the releases were a pain point we couldn’t ignore.
Baby Steps
“We can’t get to zero downtime all at once. We need to take baby steps.” After embracing the challenge of getting to zero downtime, our technical teams quickly started assessing the options. The one constant was that these solutions would require every team and every product that touched ZOLL Online to change their approach to developing and deploying applications. We decided on an approach that allowed us to chip away at the problem, providing some near-term relief as we worked toward the final goal.
Our first step: reducing downtime to 30 minutes. Right out of the gates, we set a stretch goal for the team to reduce downtime to 30 minutes for the release in six weeks. Our staff jumped in, testing and refining the deployment process, and identifying several opportunities to streamline the tasks involved. Getting to the next milestone – 15-minute deploys – introduced a new level of complexity. The team had taken process optimization as far as it would go, and the next step would require a more technical solution. The team adopted a “blue-green” approach, a deployment model where you keep some servers operational with the current version of your software, deploy the new software to an alternate set of servers, then throw the switch and point users at the servers with the new software. For this stage, we focused on blue-green deploys for our web and application servers, reserving our precious outage window for database updates. We learned some valuable lessons about how to manage sessions and balance the load across our infrastructure as we made the switch. Next stop: zero downtime.
The team now faced their largest hurdle: how to update the database while avoiding downtime and without impacting existing sessions. With the light at the end of the tunnel, all of our ZOLL Online teams incorporated “backwards compatibility scripts” into their development toolbox. This approach allows us to make incremental changes to the database in real time, changes that are both required for the new software and for providing support for the current software. This approach requires a little more up-front planning and design, but was quickly integrated into our shared software development methodology.
The Finish Line
July 2017 marked our first official monthly release with no downtime. I say “official” because we had zero downtime releases for four months leading up to the July release. The team was confident in our approach, but wanted to thoroughly vet our ideas prior to throwing the switch for real. The feedback from the ZOLL Online community was immediate – zero downtime has been a huge hit! Having reached this milestone, the team is ready for the next big challenge as we evolve the ZOLL Online platform for agile delivery of high-value software to our customers. Kudos to our research and development (R&D) and site reliability engineering (SRE) teams for all their hard work, and look for more improvements – appearing monthly!