This post is part of my Test-Driven Infrastructure series that covers TDI from start-to-finish. If you like this one, check out the others in this series.

It is my belief that any company that generates revenue from software should use Test-Driven Infrastructure principles when building and maintaing their software infrastructure.

I think this is an important issue that will not only make life easier for the sys admins or DevOps engineers. This is something that the CTO should embrace as standard operating procedure for the entire department so that the business runs more smoothly.

It doesn’t matter what configuration management tools you use. If it’s not tested, it’s technical debt that will bite you sooner or later.

Before I get into my reasons why, first a definition:

What is TDI?

Test-Driven Infrastructure (TDI) is the process of writing tests for the infrastructure and application configuration, and only then building out the infrastructure to satisfy those test assertions.

The components of TDI are:

  1. Write a failing test
  2. Satisfy the assertion to make the test pass
  3. The tests need to be run quickly and frequently in a non-production environment.
  4. You must satisfy the failing test with an automated system such as Ansible, Chef, Puppet, Salt, or bash scripts.

If it looks familiar, it’s simply the infrastructure counterpart to Test-Driven Development (TDD).

The popular buzzword for it is “infrastructure as code.” Or, more generally DevOps.

Example

Below is an example test written in Serverspec that asserts that nginx should be installed:

describe package('nginx') do
  it { should be_installed }
end

The next step is to follow up the failed test with some configuration management code, such as this Ansible task that ensures nginx is installed:

- name: install nginx
  apt: pkg=nginx state=installed

Thus, every change to your infrastructure has the configuration code plus a corresponding test.

As long as this can be run locally, for example against Vagrant, then you have an automated process that is run in a pre-production environment.

That is TDI.

Now, why should you build your infrastructure this way?

It’s Good For DevOps Engineers

As a DevOps engineer, SRE, or sys admin, if you haven’t built infrastructure in a TDI method, you don’t know what you’re missing out on.

The beneifts of TDI include:

  • Stress-free changes from developers
  • Refactoring configuration management scripts is a breeze
  • Simple configuration for security hardening
  • Simple configuration for performance tuning

Stress-free Development Changes

TDI means you’ll spend much less time testing changes due to developer requests.

Consider this request from a developer to a DevOps engineer:

“Hey, this new feature I built generates PDFs in Python. I’m using reportlab, which depends on the server having build-essential, libfreetype6-dev, python-dev and python-imaging installed. What should I do?

The DevOps engineer’s response could be:

“Send a pull request in the DevOps repo for that application. Update the config to install those dependencies along with a test for each one. Then, if all of the project tests pass, I’ll accept it and deploy it when that feature rolls out.”

Then, assuming the project tests are running against a CI server, all you have to do is review the change and click ‘Accept.’

Refactoring

Imagine this scenario: You realize that the way you initially wrote your Ansible roles isn’t ideal, and you’d like to break two Ansible roles into three. This could affect many different server types.

Without tests around your configuration management code, you’ll probably have to manually rebuild and test deploy the affected servers. That could take a while.

With tests, in a matter of seconds you would have assurance that nothing broke.

Security Hardening Configuration

Security configuration for a server is complex, and small changes to a system can have large downstream effects. In fact, Security Misconfiguration is #5 on the OWASP Top 10 Security Issues

When you refactor and modify your configuration management scripts over time, how can you be 100% sure that your changes didn’t re-introduce a vulnerability or break a compliance rule?

For example, what happens if in the process of rewriting a configuration script you accidentally change the permissions on a log file from 600 to 644? Without a detailed code review of every commit, this might go unnoticed without tests. Even if you do eventually catch the change with a manual review, the feedback loop for that might be several minutes, or days.

But with tests, you’ll catch that easily and quickly.

Performance Tuning Configuration

If your application gains traction, you’ll probably tackle performance tuning sooner or later. Not just small code optimizations, but the entire stack will be tested and improved.

When you refactor and modify your configuration management scripts over time, how can you be 100% sure that your changes didn’t create a performance regression?

For example, what happens if in the process of rewriting a configuration script you accidentally delete the nginx config line that sets worker_connections to 2,000, causing it to fall back to the default of 512? That’s a very sublte change that could be found in a manual code review, but also might slip through.

An automated test suite would catch it easily though.

It’s Good For the Business

If you’re a CTO or an executive in your business, why should you be excited about TDI? Don’t tests just add time to project timelines?

Respond to Change Quickly and Confidently

The fact is, things change, and you need to have confidence that you can respond quickly without impacting your customers or clients.

Consider these types of events:

  • A version of software that you use hits it’s end-of-life
  • A critical security vuln is found in your open source stack
  • A quick and dirty low-traffic application you built needs to suddenly scale up quickly
  • You sign a big client that needs to support 10X traffic you currently support

If you’ve been in the IT world for a while, you know that these aren’t black swan events. They happen frequently enough that you can’t afford to scramble to accommodate each one as they occur.

The only constant in the software world is change. If your business relies on making money from a software product, you can guarantee that any system you build is not static.

Put yourself in the shoes of the DevOps engineer in the section above. If a time-sensitive infrastructure project came down the pike, wouldn’t you want him to have a reliable, robust test suite to rely on? Or are you okay with him scrambling with a lot of manual tasks and validations?

Give your developers, sys admins, and DevOps engineers time to develop a TDI process.

Make More $$$

TDI is one component in an overall DevOps process. The benefits of DevOps as a whole have been shown to have a tremendous benfit on organizations.

The 2014 State of DevOps Report found this result:

Firms with high-performing IT organizations were twice as likely to exceed their profitability, market share and productivity goals.

Everyone can understand those benefits.

While the study didn’t investigate TDI process in detail, automated testing overall was shown to correlate with higher performing teams and more satisfied engineers.

Debunking Excuses Against TDI

Here are some excuses I’ve heard about why not to implement TDI and my rebuttal to each.

“My configuration code is declarative, so I don’t need tests”
Unfortunately, I’ve seen this excuse given by the creator of Ansible in a discussion thread. As much as I love Ansible, I strongly disagree with that statement.

This type of thinking is misunderstanding what we’re testing. You’re not testing that your CM tool works, you’re testing that you’re using it correctly for your own use case. There’s a big difference there.

There are several cases where testing will catch bugs and changes that your declarative system will just hum along without complaining:

  • Accidentally changing file permissions
  • Silent failures
  • Typos for file names
  • Missing entire roles or modules

For example, here is a typo I just made myself, that I only caught because of a test:

- name: disable default site
  sudo: yes
  file:
    path: /etc/apache2/sites-enabled/000-defaultconf
    state: absent
  notify: restart apache

It was discocvered by this failing test:

Failures:

  1) php_web File "/etc/apache2/sites-enabled/000-default.conf" should not be symlink
     On host `54.237.93.222'
     Failure/Error: it { should_not be_symlink }
       expected `File "/etc/apache2/sites-enabled/000-default.conf".symlink?` to return false, got true
       sudo -p 'Password: ' /bin/sh -c test\ -L\ /etc/apache2/sites-enabled/000-default.conf

The bug is that I mistyped the conf file name as defaultconf instead of default.conf. Here is the fix:

-    path: /etc/apache2/sites-enabled/000-defaultconf
+    path: /etc/apache2/sites-enabled/000-default.conf

And now my test passes again:

File "/etc/apache2/sites-enabled/000-default.conf"
  should not be symlink

That’s one example of an infrastructure bug caught pre-production thanks to TDI.

“Okay, I’ll test, but just one test to see if the service is running”
The problem with that is that integration tests can’t check for high traffic or security vulnerabilities.

What are you going to do, run a battery of load tests to confirm that postgres and TCP settings are set properly? Or add a couple lines to your spec file that assert it’s configured?

“I don’t know how to get started” / “It’s too time consuming”
I do agree that it can be overwhelming at first.

However, if you look at the entire lifecycle, you’ll save a lot of time with tests. With testing in place, I can now build an entirely new server and have it deployed within an hour.

To help others out, I’ve created an example DevOps git repo based on Ansible and Serverspec: https://github.com/kday/ansible-from-playbook-to-production

Conclusion

In conclusion, there are many great reasons to implement a TDI process in your company.

It benefits everyone in the company and leads to less overall technical debt.

Test-Driven Infrastructure Resources

  1. Test-Driven Infrastructure with Chef
  2. Monitor-Driven Development Using Ansible
  3. Testing infrastructure with serverspec
  4. Test Driven Development with Ansible
  5. Google Groups: Test Driven Development with Ansible?
  6. Test-Driven Infrastructure (TDI)
  7. Test-Driven Infrastructure Development - PuppetConf 2013
  8. Collection of Test Driven Infrastructure Links
  9. Github: Ansible from playbook to production

General DevOps Resources

  1. What is DevOps
  2. 2014 State of DevOps Report (PDF)