How I Fixed a Bug in Terraform's AWS Provider (and Got It Merged Same Day)
In 2022, I was debugging a Terraform apply failure that made no sense. Subnets were failing to create, but only sometimes, and only when IPv6 was involved. Turns out I’d found a race condition in the Terraform AWS Provider itself — a project maintained by HashiCorp with millions of users.
Here’s how I tracked it down, fixed it, and got PR #24685 merged the same day I opened it.
The Symptom
We were rolling out IPv6 across our VPCs. Terraform would create the subnet, but the IPv6 CIDR association would intermittently fail with:
Error: error waiting for EC2 Subnet (subnet-xxx) IPv6 CIDR Block
Retry usually worked. Classic race condition smell.
Digging In
I started in the AWS Console — the subnet existed, the IPv6 CIDR was associated. So why was Terraform failing?
I pulled the provider source code. The subnet creation flow:
- Create the subnet
- Wait for it to become
available - If IPv6 is configured, associate the CIDR block
- Wait for the association to complete
Step 4 was the problem. The provider was polling for the association status, but it was checking the wrong attribute. The IPv6 CIDR association has its own lifecycle — associating → associated — but the code was checking the subnet’s top-level state instead.
When AWS was fast, it worked. When the association took an extra second, Terraform saw the subnet as available but the CIDR association was still associating, and it declared failure.
The Fix
The core issue was in the waiter logic. Simplified, the before/after looked like this:
// BEFORE: checked subnet state, missed association state
func waitSubnetIPv6CIDRBlockAssociationCreated(conn *ec2.EC2, id string) error {
stateConf := &resource.StateChangeConf{
Pending: []string{ec2.SubnetCidrBlockStateCodeAssociating},
Target: []string{ec2.SubnetCidrBlockStateCodeAssociated},
Refresh: statusSubnetIPv6CIDRBlockAssociation(conn, id),
// But statusSubnetIPv6CIDRBlockAssociation was reading the
// wrong field — it checked the subnet state, not the
// association state
}
}
// AFTER: explicitly check the IPv6 CIDR block association state
func statusSubnetIPv6CIDRBlockAssociationState(conn *ec2.EC2, id string) {
// Find the specific association by ID and return ITS state,
// not the subnet's top-level state
}
The actual diff was ~50 lines. I forked the provider, wrote the fix, tested it against our infrastructure, and verified the race condition was gone across 50+ applies.
Opening the PR
I filed issue #24681 explaining the bug with reproduction steps, then immediately opened PR #24685 with the fix.
Key things I did to get a fast review:
- Linked the issue with a clear explanation of the race condition
- Included the root cause analysis in the PR description, not just “fixes #24681”
- Kept the diff minimal — only changed what was necessary
- Added tests covering the specific scenario
- Tested against real infrastructure and included the results
A HashiCorp maintainer reviewed it within hours. Merged the same day. Released in provider v4.14.0.
The Ripple Effect
After the fix shipped, I found out this wasn’t just our problem. HashiCorp’s own CI was hitting the same race condition. Other users had been working around it with depends_on hacks or retry scripts.
One fix, committed to the right place, solved it for everyone.
What I Learned
Read the source. When Terraform does something unexpected, the provider code is right there on GitHub. It’s Go, it’s readable, and the answer is usually in the waitForState functions.
File the issue before the PR. Maintainers want context. The issue gives them the “why,” the PR gives them the “how.” Together, they make review fast.
Keep it small. My PR changed ~50 lines. A reviewer can hold that in their head. A 500-line PR sits in the queue.
Contribute upstream. Every workaround you write in your own code is technical debt. If the bug is in the tool, fix the tool. The maintainers want your help — they can’t reproduce every edge case.
The Terraform AWS Provider has 10,000+ contributors. Getting a PR merged feels like a big deal, but the process is surprisingly approachable. Find a bug, understand it, fix it, explain it. That’s it.