Terraform best practices

Lock versions! Modules, providers, and Terraform itself. The devil you know is better than the devil you don't. Use either Docker docker run -v $(pwd):/<dir name>/ -w /<dir name>/ --rm -it hashicorp/terraform:light <terraform command> or tfenv. Use a shell alias or bash wrapper to make life easier for the version you standardize on!

Don't forget about .gitignore

terraform.tfstate.backup
terraform.tfvars
.terraform/

Organize based on what is within your control. You cannot control reorgs, for example.

Tag every resource liberally. Owner/contact, service role, environment, biz unit, managed_by=Terraform, etc. No tags, no merge!

Use remote state. Seems obvious, basically a requirement when operating as a team. Secured and backed up! Use state lock to prevent collisions. Don't commit state to git...

Use a data source with TF remote state to share things like IDs across state files. Example:

# stateful service outputs a SQL DB id
output "sqldb_id" {
  value       = azurerm_sql_database.example.id
  description = "Database ID"
}

# stateless service references that ID for configuration
data "terraform_remote_state" "dev_sqldb" {
  backend = "azurerm"

  config = {
    storage_account_name = "terraformsa"
    container_name       = "terraformstate"
    key                  = "development/sqldb.tfstate"
  }
}

# now using in a file
data.terraform_remote_state.dev_sqldb.outputs.sqldb_id

Look at using community modules instead of rolling your own first. Don't reinvent the wheel, look for the most downloaded/used modules. That said, you need to know what is getting created and it shouldn't be a mystery when using community modules. If filling in 50+ parameters (e.g. EKS module), including for features you may not use, it's ok to take a step back and consider building your own module to make it more maintainable and understandable.

When you do create your own modules, strive to keep them clean and stateless.

Apply coding best practices. KISS, DRY, linting, formatting, pull requests with reviewers before merging and CI.

Do:

Use git for managing TF files
KISS & DRY yet Human & Clean (humans read and maintain Terraform)
Functional programming & idempotency

Variables and locals

Variables, their purpose is for settings for module configuration. Make sure to set a default or validation! If used just once, set a default. If used per region/etc, use tfvars file instead.

variable "az_names" {
    type = list(string)
    default = ["use-west-2a"]
}

tfvars:
aws_account = "xxx"
aws_region = "us-west-2"
dc = "foo"
zones = ["a"]
subnet = "bar"

Locals can use expressions and resource arguments so they are dynamic. Used in modules. Do things like enforcement of tag names or bucket names. Consider the local a constant to be relied on. Be aware of cloud provider character limits.

locals{
    bucket_name = "${local.company_name}-protected-${local.suffix}-${var.dc}"
}

Do:

keep all variables in one file
use tfvars where necessary
utilize env vars
use for locals de-hardcoding one-time names, DRY
keep things generic
leave logic to modules, KISS
avoid using locals outside of modules
pass outputs to modules using `data`` sources

Don't:

use multiple locals blocks if not totally necessary, hard to track and maintain
decentralize vars/tfvars, keep them in one file for easier maintenance
ignore env vars

Ugly:

Hardcoded variable values where not necessary

Execution

Do:

use remote execution
use TF apply with a plan file
setup a TF timeout so you aren't waiting forever when something doesn't go right

Don't:

execute locally, keep state where it's supposed to be

Ugly:

execute locally and Ctrl+C to fubar your state

Practices enforcement

Tag resources
Linting (tflint) & formatting (terraform fmt)
Clean and reusable
Documentation
More complex things like notify a FinOps team if expected spend is over a threshold

Do:

pre-merge linters, formatters, and logical checks
GitHub actions/CI pipeline checks
Slack bot drift reporter
main branch = actual environment

Security issues and non-compliant configs if you don't have any practices enforcement.

Structure

Huge state files have a big blast radius, you want small state files. Best:

modules/
  my-cool-service/
    global/
    regional/
root-modules/
  my-cool-service/
    development/
      global/
      us-west-2/
    staging/
      global/
      us-west-2/
    production/
      alpha/
        global/
        us-east-2/
        us-west-2/
      beta/
        global/
        us-east-2/
        us-west-2/
      gamma/
        global/
        us-east-2/
        us-west-2/
  an-equally-cool-service/
    development/
      global/
      us-west-2/
    staging/
      global/
      us-west-2/
    production/
      global/
      us-west-2/

Good:

root-modules/
  my-cool-service/
    development/
        us-west-2/
    staging/
        us-west-2/
    production/
        us-west-2/
  an-equally-cool-service/
    development/
        us-west-2/
    staging/
        us-west-2/
    production/
        us-west-2/

Workspaces

Use-cases:

different environments (??? double check this, think Hashi says not to use for this)
different regions
Same config, different customers/tenants
test a set of changes before modifying prod

Use a TF wrapper to make sure you're using the right workspace!

Tools

Atlantis

Handles TF lock
Manages state and history
Ensures main branch is the actual environment state.

Others

TFLint
TFsec
Infracost
Inframap
ValidIaC
Terratest
Terratag
Terragrunt
Checkov
KICS
Super-Linter

References

https://www.youtube.com/watch?v=U_CsR5ibrOI
https://blog.gitguardian.com/infrastructure-as-code-security-best-practices-cheat-sheet-included/
https://spacelift.io/blog/terraform-state
https://substrate.tools/blog/terraform-best-practices-for-reliability-at-any-scale