Canary Deployment: What It Is and How To Use It

Deploying to production can be risky. Despite all the mitigation strategies we put in place—QA specialists, automated test suites, monitoring and alerts, code reviews, static analysis—systems are getting more complex every day. This makes it more likely that a new feature can ripple into other areas of your app, causing unforeseen bugs and eroding the trust customers have in you.

Taking their cue from the miners of old, developers created the idea of canary deployments: releasing a new feature to just a subset of your users or systems. Rollout.io calls this gradual rollout. If we enable a feature within just part of our system, we can monitor any problems it creates. This lets us keep general customer trust high while freeing us to focus on innovation, delivering excellent new features to our customers.

History in Mining

The term “canary deployment” comes from an old coal mining technique. These mines often contained carbon monoxide and other dangerous gases that could kill the miners. Canaries are sensitive to air-based toxins, so miners would use them as early detectors. The birds would often fall victim to these gases before it reached the miners. This helped ensure the miners’ safety—one bird dying or falling ill could save multiple humans’ lives. In the same sense, the first part of our system to which we release a new feature acts as our canary: it detects potential bugs and disruption without affecting every other system running.

OK, But How Do I Make This Magic Happen?

The idea itself is straightforward, but there are a lot of nuances in how we should approach deploying these features. Often, we must know ahead of time that we’ll be canary releasing.

Does the Feature Need It?

Canary deployments have a cost. They add noise to your codebase that slows down development. The feature’s release will need to be maintained over a noticeable period of time, so this eats a bit into your team’s capacity. If you want to put a feature in a canary deployment, you need to be able to justify these costs.

Does this feature touch multiple areas of the application? Is this feature highly visible to the customers? Does it have a large impact on the customer base? Is it a relatively complex feature compared to others in the application? These types of questions can help you determine if canary deployment will be worth it.

It probably won’t be worth it to canary deploy a new field on the customer admin screen. But it might be worth it if you’re adding a major uplift to customer shopping carts.

What Will Be Your Canary?

It is important to know what things in your system you can use to partition features. There are commonly two areas that make great canaries: users and instances.

By User

Most applications have some concept of user. And most applications also make it easy to get certain pieces of information about the user, such as age, gender, and geographic location. You can query this information when running a feature to see if you should show it to that user.

You could partition by geographical region, showing only your Chinese customers a new feature. Or you could even partition on pure percentage, only showing 5% of users the new feature and seeing if your error counts spike or if your responsiveness slows down. Try to choose a partition where trust is high or where the loss of customer trust will have a low impact. Perhaps sales in your Bulgarian market is small enough that a bad release won’t hurt the bottom line too much.

Another idea is to create an early adopters program, letting people opt into new features. Doing this ensures that customers expect some level of disruption and will be more willing to overlook problems. Video game companies have been doing this for years.

By Instance

Separating by users is an easy way to start canary deployment. But if your system is large enough, you can consider using your application and service instances as canaries. If you have multiple instances of your application, you can configure a subset of them to have the new feature. This can be especially useful if you have multiple regional data centers. However, this is often less flexible than partitioning by user.

A good partition is a sliding scale or a set of discrete values. You want to avoid partitions that are only on/off so that you can better correlate impacts as you scale up the feature in your system.

What Infrastructure Do I Need?

If you want to implement the ability to canary deploy in your system, there are lot of options. The system needs to be able to partition the feature in some capacity, based on what you know will be your canary. You also want to ensure you can change this partition at runtime. This can be homegrown, meaning you can just slap in a database table and a class to take in your user context. You can use your load balancers to route traffic based on regional or user headers in the requests. And you can save some development time and purchase tooling that will make it easy to set up canaries.

How Do I Know if Something Goes Wrong?

Canary deployments will only be useful to you if you can track their impact on your system. You’ll want to have some level of monitoring or analytics in place in your application. These analytics must correlate to how you’re partitioning your features. For example, if you’re partitioning by users in a region, you should be able to see traffic volume and latency by each region. Some useful analytics are latency, internal error count,  volume, memory usage, and CPU usage.

Fortunately, it’s easy these days to wire in analytics and monitoring. Google Analytics lets you slap JavaScript on a page header. You can grab open source options for no upfront cost, or you can get great capabilities through purchasing commercial products. If you’re on a cloud platform, many of these metrics are built in. It’s usually not worth building it yourself, but you may want to tweak an existing package according to your needs.

When Do I Release the Feature to Everyone?

As I mentioned earlier, canary-deployed features need to be maintained over time. Eventually, we want to remove the partition completely and let everyone use the feature.

Have a roadmap of how you will release the feature ahead of time, even if it’s a generic roadmap you use for all your canary-deployed features. This will give the team a big and visible end date in sight. They won’t be caught off guard when disruptions happen in the system and they have to triage them. Eventually, you can kill the canary and remove the noise from your code or configuration.

The roadmap should have a timeline of not only when it will end but also how you you plan to scale the feature. For example, maybe your roadmap is that you’re going to roll out a new product line first to China, then to India, then to all of Asia, and then to the world. Most importantly, it should have a rollback plan that your team members clearly understand and can handle.

Focus on Achieving Excellence, Not Avoiding Risk

If you implement canary deployments for your features, you’ll feel a significant mental weight lift off of you. You’ll find yourself thinking less about production outages and disruptions. Instead, you’ll think more about how to push that next exciting feature to your customers.

This post was originally published on the rollout.io blog by Mark Henke.

Blue/Green Deployment: What It Is and How it Reduces Your Risk

Having to take your application offline for updates can be a pain. You can mitigate this with consistent, scheduled downtime, but it’s not something that brings delight to customers. What’s more, some sites can lose thousands of dollars per minute they’re down. There are many reasons an app can go down, but deploying or upgrading your application shouldn’t be one of them! We have a tool we can use to ensure that our deployments create no downtime: blue/green deployments.

What It Is: A Few Wonderful Colors

I don’t know who originally decided to use the colors “blue” and “green.” But the gist is this: you have an instance of your application, a green version, in production. You also have a router that routes your user traffic to the app. You need to get a new version, the blue version, out so that your users can get some new goodies. But you want to ensure that if a user goes to look at one of your screens or presses a button, they can still do so—even while you’re deploying blue. If you can secretly deploy green while blue handles all traffic in the meantime, then you can eventually swap out the connections so that everyone stops going to green and goes to blue instead.  So you follow these steps:

You start with the green version in production.

Deploy the blue version to the same environment as the green version. Run any smoke tests, as necessary.

Connect router traffic to the new version alongside the old version (the green version).

Disconnect router traffic from the old version.

Decommission the old version from the environment, if necessary.

Seems pretty straightforward when it’s broken down, but the devil is in the details. Every platform and language has different ways of approaching blue/green deployments, but most have the capability to do it.

How it Reduces Risk

As said above, when we blue/green our deployments, we can deploy without creating application downtime. And when we deploy without downtime, we eliminate or reduce quite a few risks that directly affect our business and our development team.

Here’s what you can enjoy when you eliminate your risk with blue/green deployment:

No Surprise Errors

Put yourself in the mind of your users for a moment. Let’s say you want to order an item. You fill out your billing address and your street address, then you go on to enter your payment information. You agree to the shipping fee and uncheck the “receive spam mail” box. Finally, you press that blessed submit button only to get an error message: “Your order could not be submitted at this time. Please try again later.” And all that precious time filling out your information is lost. If you’re lucky, you get a specific error message like “Application is offline for maintenance.” Most of the time, you get the error message equivalent of ¯\_(ツ)_/¯.

When we blue/green our deployments, we never need this maintenance screen. From your user’s viewpoint, there’s a list of items upon one click, and upon the next click, they see that new menu you added. This will keep furious emails about error screens from flooding your inbox. Let’s give users surprise features, not surprise errors!

Go Ahead, Test in Production!

Often, it’s healthy to ensure your pre-production environments are as close to your production environment as possible.  As much as we would like prod to be the same as our QA or staging environment, we don’t always get our way. This can cause subtle bugs in our configurations to seep through. With blue/green, it’s no problem; you can test the app while it’s disconnected from main traffic. Your team can even load test it, if you so desire.

You Accommodate Customers Who Shop at Weird Hours

There’s a constant struggle to find that sweet, sweet deployment window—that time when no one cares. This is tricky, as our customer bases are more global than ever. There’s no longer an internationally good time to do a deployment, especially if you work in an enterprise where the business needs to be running around the clock. If you have a customer-facing application, this means a customer who can’t place an order may place it on some other website. You just lost a sale. If you have an internal application, this means an employee can’t do their job and is actively losing your company money.

By blue/green deploying, you assure your traffic never stops. That customer can place their order just fine without disruption, giving you that sale. That employee overseas can continue to do their job without interruption, saving your company money. The longer your current deploy downtime is, the more valuable this is.

You Get to Sleep Instead of Deploy

We just talked about customers who shop at weird hours. But what about you or your developers—the ones forced to put out fires at those weird hours? Finding the right deployment window can lead to devs doing deployments over the weekend. In extreme cases, it has to be done at four AM or some other absurd hour. I remember being on call and having to wake up because the weekend deployment failed. I was groggy and frustrated, and the furthest thing from my mind was ensuring all the quality checks were in place when I made any fixes. This encourages human error, especially in more manual deployments.

If we apply blue/green, we can deploy whenever we want. More specifically, we can deploy during office hours, when we can bring our full team to bear on any issues that occur. We can deploy while the coffee in our veins is in full effect, giving us that mistake-avoiding brainpower.

Easy Recovery

As much as we like to think we’ve done everything right, sometimes we introduce bugs. We can either spend inordinate amounts of money ensuring deployments will always be defect-free—and still occasionally find them—or we can ensure that when we inevitably find them, we recover quickly and easily. By blue/greening our deployments, we have our older, more stable version of our application waiting to come back online at a moment’s notice, evading the pain of trying to roll back a deployment. This is especially valuable if your deployments have many manual steps.

There Are No Silver Bullets

As great as it is using blue/green deployments to remove downtime, it doesn’t come free. There’s often a cost to supporting two versions of your application at the same time. This ends up significantly affecting your data model but can affect other areas as well. I would only suggest applying blue/green when some of the above risks may apply to your application. If you find that none of them do, go ahead and enjoy those simple swap n’ drop deploys.

The Death of Downtime

Blue/green can be an extremely powerful way to reduce pain and risk in your application lifecycle. If you’re the manager of a development team, I encourage you to assess if any of these risks apply or may apply to your application. If you’re a team member for an application but not the main decision maker, you can use these as selling points to convince your manager to institute zero-downtime deployments. Go ahead. Add a couple steps to your pipeline, and watch as your fears and pain melt away.

This post was originally published on the rollout.io blog by Mark Henke.

Social Media Auto Publish Powered By : XYZScripts.com