Validating ARM Templates with ARM What-if Operations

6 minute read

The ARM template deployment What-if API was firstly announced to the general public at Ignite last year. It has finally been made available for public preview not long ago. This is a feature that I’ve been keeping close eye on ever since I heard about way before Ignite, when it was still under NDA.

In a nutshell, comparing to the existing ARM template validation capability (Test-AzResourceGroupDeployment, Test-AzDeployment, etc.), the what-if API provides additional capability that provides you an overview on if your template is deployed, what resources will be created / deleted and modified. Although the what-if API is still in preview and still have many rough edges, I think it’s now the time to get my hands dirty and start playing with it. I’ll share my experience and opinions in this post.

What-If API vs Existing ARM Template Validation

Prior to the What-If API, we’ve always had way to validate our ARM templates. the latest Azure Powershell module ships the following commands for validating different types of ARM template:

Or you can also use the REST API, or Azure CLI.

This validation does nothing but validating the ARM template syntax, which does not guarantee a successful deployment. The What-If API actually validate your template against your deployment target, it also detects errors specific to your environment.

for example, it detects that the deployment target has exceeded maximum deployment quota of 800 (thanks to my friend Alex Verkinderen for providing this screenshot):

image

It will also detect syntax errors in the template:

image

What-If API vs Terraform Plan

If you have used Terraform before, the what-if API will probably remind you of terraform plan, which does exactly what what-if does, but the significant difference between ARM what-if API and terraform plan is: what-if API does not use state files. This is a huge advantage comparing to Terraform.

I’ve never liked Terraform. I’ve used it (because I had to) for AWS and GCP. In my opinion, terraform and its state files are such as PITA.

When you deploy a terraform template, it stores the state of the deployment (tfstate) in a folder you specify. When you deploy an updated version of the template, or when you use “terraform destroy” command to delete previously deployed resources, terraform compares the request against the state file and figure out what exactly needs to be performed. This method only works well when working in a small project (that you are the only developer) and in a fairly static environment. In reality, there are many problems that are introduced by this terraform feature:

1. Additional admin effort to create and maintain a shared location for terraform state when working with a team of developers.

If you are part of a dev team, you will need to setup a shared location for storing terraform state files for each template, and make sure all your team members are using the shared folder.

2. Modifying resources outside of terraform template is like the end of the world.

If you quickly modify your resources using PowerShell, or via the portal, or any other method, the terraform state (which is cached offline) will be out of sync from the real state of the resources. Now, think about what’s going to happen if you also use Azure Policy to deploy additional resources after your initial terraform deployments have already completed (i.e. Policies that use deployIfNotExists effect to deploy VM Extensions or diagnostic settings, etc.)?

I’ve seen people addressed this issue by not giving administrators / developers any administrative rights in any environment to enforce the use of Terraform (in this case, Terraform Enterprise, see my next point). When you are debugging an issue, or when shit hits the fan and you are fixing a production issue, sometimes you just want to quickly make a small change manually. Sorry, I’m afraid you can’t do this if you are operating in a Terraform shop. Once you’ve deployed a pattern with Terraform, you are locked in with it.

3. What about the Enterprise Solution for Terraform?

To tackle the problem with sharing state files between multiple developers, Hashicorp has an enterprise version of Terraform called Terraform Enterprise (TFE). TFE offers a web portal and a set of REST APIs that allows people to upload the terraform templates to your TFE workspace, and it will deploy the template and maintain the state for you. It becomes a central deployment server in your environment. Although it fixes one problem, it sure introduced other risks: it becomes a single point of failure. If it fails, you won’t be able to deploy / update your cloud environments. Since it also stores templates, secrets, etc, it becomes a great target for attacks – especially in a multi-cloud environment. In a large enterprise, your security team will sure hate this platform.

So, what about the ARM What-if API? it DOES NOT use any kind of offline state files. When you use it to evaluate your ARM template, it compares what’s defined in your template with what’s deployed in your Azure environment in real life. This is a HUGE advantage. I’ve heard so many people bragging about how useful terraform plan output is, and now, Microsoft just introduced the same capability in native ARM APIs, and removed the complexity of having to maintain offline state files.

What-If in Action

To test it out, I’ve created a simple ARM template that creates the following resources:

  • a VNet with several subnets
  • a Network Security Group (NSG) for the subnets

When I validated it using the What-If API via PowerShell (Get-AzResouceGroupDeploymentWhatIfResult), since it’s never been deployed, the result showed me the resources that will be created (in green):

image

I then deployed the template, the deployment completed successfully.

After the initial successful deployment, I’ve updated the template to add the a bastion host to the VNet. The following resources were added to the template:

  • Bastion Host
  • Subnet for the Bastion Host
  • Public IP for the Bastion Host
  • NSG for the Bastion Host subnet

I validated the updated template again, it showed me what will be deleted / created / modified and not changed (represented in different colours and symbols):

image

if you want to programmatically manipulate the changes, you can access them as the properties of the result object

image

image

Reducing Noise

Since the What-if API is still in preview, it’s not perfect. It is only as good as how well each Azure Resource Provider is implemented. You will see some false positive depending on the resource types. For example, from my previous screenshot for the result when adding bastion hosts, it has shown all the subnets will be deleted. Obviously, this is not the case. What-If leverages a purposely built noise reduction service in Azure to calculate the result when you call it. The product group is still working on reducing the false positivise. It is explained here why does noise occur: https://github.com/Azure/arm-template-whatif#why-does-noise-occur.

I strongly encourage you to try it out, file issues if you’ve experienced false positives or other bugs at it’s GitHub repo: https://aka.ms/whatifissues

What’s Next?

As this API gradually become more and more mature, I will definitely try to incorporate into my CI/CD pipelines. Once I’ve got something worth showing, i will post my solution here.

If you want to learn more about what-if, check out this YouTube video from the Alex Frankel, who’s the PM responsible for this API at Microsoft.

Leave a comment