Everything is Continuous

# The pain

As the Ops team grew and Devs teams grew, it became a recurrent task to be asked by the Ops team to test infrastructural changes.

This didn’t use to be a problem, the teams were smaller, we had less features and the infrastructure was much simpler but they became a blocker for both teams as they have to be working simultaneously on the same task.

Even though we have Unit Tests, Integration Tests, Acceptance Tests and Smoke Tests and they run every time there is a new build (Continuous Integration) we feel the pain of infrastructural changes as these tests are not run for every change done by the Ops team.

Also, Ops working with every single Dev team made the problem even more serious with Dev teams becoming blocked due to the infrastructural work of other Dev teams which might not be related as well as Ops being used as a shared resource.

Even when both teams are synced to test a change there isn’t the knowledge on the Ops side to know how much testing is enough. This is due to the fast pace of development and Ops needing to be up to date with the knowledge of how the features work - this is a nice problem to have but needs to be solved.

Metaphorical wall between devs and ops..

Metaphorical wall between devs and ops..

# Our idea

In a perfect scenario we would have enough people on the Ops side so that they can run side by side with Dev and be up to date with feature changes but that doesn’t happen and we can’t expect Ops to know every single feature and application. So a second option is making this as automated as possible to make problems be caught fast as they do on the Dev side but without having Ops do extra work.

Our solution for this was adding infrastructure to the continous delivery system but as we don’t need to know when these changes are done we use an active wait to run the necessary tests.

TL;DR: We made Smoke tests more concise, running continuously over fixed periods of time and easily accessible by everyone.


Make it visible as fast as possible

How we got there

A common scenario to test infrastructure would be running Smoke Tests to make sure the end to end path for a feature works. In my current team, it would be something like Downloading a music album or Streaming a track.

Reviewing the current tests there were some issues trying to simplify and slim down the tests:

Large number of Smoke tests - Might be a smell that the application is trying to do too much and it will also slow down the overall time to assess the state of the application.

Ambiguity between Smoke tests and Acceptance tests - a Smoke test is a feature/edge case with a happy path. In my opinion, a Smoke test is a specific type of Acceptance test. Smoke tests shouldn’t do tests for parameter validation, missing data or other types of test that require very specific set-up. Even worse, having a specific set-up for Production Smoke tests means we’ll have static test data as we don’t want dynamic test set-ups in that environment.

Slowness to test - If we already have Smoke tests down to a minimum but running the suite still feels slower than it should then maybe it is a performance issue. Is the performance being tracked?

Using Continuous Integration software for Smoke tests means that Ops has to use yet another tool. They should be able to have information about the state of an application without needing to go to the team.

Configuration is within the application - It is not obvious that if an application’s stack is changed that we can try to Smoke test that variation without a new build.

Status of Ops changes is not visible - Until an environment’s Acceptance and Smoke tests are run, there is no information about external changes. This actually makes us lose the advantages of Continuous Integration and move closer to the concept of a Nightly Build…


Measure it before changing it

Some simple steps that are being done to try and fix it:

Review all Acceptance and Smoke tests to make sure they don’t overlap and that only essential Smoke tests are being performed.

Host information should be available in the runner so new variations can be tested within minutes. This also makes a case for keeping configuration outside a build.

Continuous run of Smoke tests in every environment including Production - it would be useful to know when a specific path breaks while working, giving all teams information about the stack as a whole.

Visibility of Smoke tests results - Moving closer to the concept of a status dashboard, the smoke tests should be able to be checked through an API and screen by all teams.

The increasing number of metrics being kept - Measuring API responses and number of errors will provide more information about response times, error baselines and more.


David Bowie knows about changes

What we have so far:

Ops changes can be more aggressive, they don’t have to wait and we’re not afraid to fail as this is done quickly and very early in the infrastructure.

This isn’t new but it was good to bridge concepts from the Dev team into the whole of the stack and it has allowed us to move faster to bring more servers up and add new tools without hurting current infrastructural dependencies.