The Mystery of the Disappearing Lambda Triggers: A Terraform State Drift Story
We’ve all been there: you start the day feeling good about wrapping up last sprint’s work, coffee in hand, ready to tackle something new. Then comes the Slack notification that changes everything.
“Hey, can you look into why a couple of Lambdas didn’t run two weeks ago?”
The mystery was puzzling: scheduled Lambda functions had missed their executions on a specific day, then resumed normal operation the next day without intervention. These weren’t critical path functions (those would have triggered immediate alerts), but they were important enough that we needed to understand what happened and prevent it from recurring.
Working with a teammate, we discovered that two functions had failed to execute—one triggered by EventBridge rules, another by S3 events. Different triggers, same problem, same day. That pointed to something systemic.
As we investigated the affected Lambda in the console, everything appeared normal. CloudWatch logs showed nothing unusual. CloudTrail events for that timeframe revealed no anomalies. Then, while checking back on the Lambda configuration, we noticed something odd: the trigger had disappeared right before our eyes.
A quick check confirmed our suspicion—a production deployment had just completed. The automated Terraform deployment from our CI/CD pipeline had somehow disconnected the triggers. The infrastructure was all there—EventBridge rules, targets, even the Lambda function itself—but they were no longer wired together.
The Investigation
Digging deeper, we found something interesting:
- ✅ EventBridge rules existed and were enabled
- ✅ EventBridge targets pointed to the correct Lambda ARN
- ❌ Lambda triggers showed nothing in the console
- ❌ Manual invocations from EventBridge failed with “not authorized” errors
This pointed to one thing: missing Lambda permissions.
In AWS, having an EventBridge rule with a target isn’t enough. You also need an aws_lambda_permission resource that explicitly grants EventBridge the right to invoke your Lambda. These are two separate resources:
# The EventBridge rule and target
resource "aws_cloudwatch_event_rule" "my_rule" {
name = "my-scheduled-rule"
schedule_expression = "cron(0 12 * * ? *)"
}
resource "aws_cloudwatch_event_target" "my_target" {
rule = aws_cloudwatch_event_rule.my_rule.name
arn = aws_lambda_function.my_function.arn
}
# The permission (THIS was missing!)
resource "aws_lambda_permission" "allow_eventbridge" {
statement_id = "AllowExecutionFromEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.my_function.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.my_rule.arn
}
When we checked the Lambda’s resource policy:
aws lambda get-policy \
--function-name my_scheduled_lambda \
--region us-west-2
The permissions were indeed missing. But the question remained: Why?
The AWS Architecture Problem
Here’s what the architecture looks like when everything is configured correctly vs. when permissions are missing:
The Root Cause
After investigation and research, we discovered this is a known behavior in AWS/Terraform interactions:
- When AWS deletes a Lambda function, it automatically deletes all associated permissions
- This is AWS behavior, not a Terraform bug
- When Terraform recreates a Lambda function (during a deployment), AWS silently removes the permissions
- Terraform’s state file still thinks the permissions exist (stale state)
- The permissions don’t show up in the plan when the Lambda is replaced
- They only appear as needing recreation on the next Terraform run
This creates a dangerous window where your infrastructure looks fine in Terraform, but is actually broken in AWS.
But Wait—Why Were Lambdas Being Replaced?
Here’s where the plot thickens. In normal circumstances, Lambda deployments with just code changes should update in-place, not replace the function. So we had to ask: what actually triggered the replacement that caused this mess?
Digging through deployment history and CloudTrail logs revealed a fascinating story of not one, but two separate architectural migrations that both caused Lambda replacements.
The First Migration: Zip Packages to Container Images
The initial replacement happened during a major infrastructure migration. We were moving from:
Before:
- Package type:
Zip - Runtime:
nodejs22.x - Dependencies: EFS mount (
/mnt/efs/node_modules) - Deployment: Upload zip files to S3
After:
- Package type:
Image - Dependencies: Baked into container images
- Deployment: Push to ECR, reference image URI
- Node modules: Included in container layer (
/opt/nodejs/node_modules)
This is a breaking change for AWS Lambda. You cannot change package type from Zip to Image in-place—AWS requires a complete delete and recreate. When Terraform executed this migration:
- Deleted old Zip-based Lambda functions
- AWS automatically deleted all associated permissions
- Created new Image-based Lambda functions
- But didn’t recreate the permissions in the same run
This was understandable—it was a one-time architectural migration. The real surprise came next.
The Second Issue: The Module Refactoring Migration
After successfully migrating to container images, we noticed something in the deployment logs. During the first few deployments after the migration, Lambdas continued to be replaced instead of updated in-place.
The CloudTrail logs showed a clear pattern:
- October 15: Initial Zip→Image migration (intentional replacements)
- October 22: Lambda replaced (version 105)
- October 30: Lambda replaced again (version 106)
Comparing Terraform plans between these runs revealed what was happening. The plans showed resources switching between:
- module.my_lambda.aws_lambda_function.this_image[0] (will be destroyed)
+ module.my_lambda.aws_lambda_function.this (will be created)
The root cause: During the Zip→Image migration, our Lambda module was also being refactored from a dual-resource pattern to a cleaner single-resource design. The old module had:
# Old module structure (used during migration)
resource "aws_lambda_function" "this" {
count = var.package_type == "Zip" ? 1 : 0
# Zip package configuration
}
resource "aws_lambda_function" "this_image" {
count = var.package_type == "Image" ? 1 : 0
# Container image configuration
}
The new module uses a cleaner single resource:
# New module structure (current)
resource "aws_lambda_function" "this" {
package_type = "Image"
image_uri = var.image_uri
# Single resource handles everything
}
The migration sequence was:
- First deployment (Oct 15): Migrated from Zip to Image, creating
this_image[0]resources - Module update: Refactored to use single
thisresource - Subsequent deployments (Oct 22, 30): Terraform migrating state from
this_image[0]tothis
During these transitional deployments:
- Terraform saw
this_image[0]in state file - Current code defined
this - Terraform destroyed
this_image[0], createdthis - AWS deleted all permissions when Lambda was deleted
- Permissions weren’t recreated in the same run
After a few deployment cycles, Terraform completed the state migration automatically, and subsequent plans showed the correct behavior: in-place updates.
The Lesson
What appeared to be a simple permission drift issue was actually a perfect storm of changes:
- AWS behavior: Auto-deleting permissions when Lambdas are deleted
- Planned migration: Zip→Image package type requiring replacement
- Module refactoring: State migration from dual-resource to single-resource pattern
- Transitional period: Multiple deployments needed to fully reconcile state
The replace_triggered_by solution not only fixed the immediate permission drift but also protected us during the state migration period. Even more importantly, it will prevent this issue if we ever need to replace Lambdas again for any reason (VPC changes, etc.).
The bigger lesson: major infrastructure migrations rarely happen in isolation. When multiple changes compound, having defensive infrastructure patterns like replace_triggered_by becomes critical.
Why Terraform Doesn’t Catch This
The issue is that aws_lambda_permission resources don’t automatically detect when the Lambda function they reference has been recreated. Even though the permission references the Lambda, Terraform treats them as independent resources during the replacement operation.
Here’s what happens during a typical Lambda deployment:
Terraform Plan:
- aws_lambda_function.this will be replaced
(image_uri changed)
Terraform Apply:
1. Delete old Lambda → AWS deletes permissions automatically
2. Create new Lambda → Success!
3. Terraform checks permissions... state says they exist ✓
Next Terraform Run:
- aws_lambda_permission.eventbridge[0] will be created
(drift detected - permission missing in AWS)
Notice the one-run delay? That’s the problem.
The State Drift Timeline
This sequence diagram illustrates exactly how the state drift occurs:
The Solution: replace_triggered_by
Terraform 1.2 introduced a lifecycle meta-argument called replace_triggered_by specifically for handling this class of problems. It forces Terraform to recreate a resource whenever another resource is replaced.
Here’s how we implemented it:
For Permissions Inside the Lambda Module
resource "aws_lambda_permission" "eventbridge_execution_allowed" {
count = var.eventbridge_execution_allowed_arns != null ? length(var.eventbridge_execution_allowed_arns) : 0
statement_id = "AllowExecutionFromEventBridge_${count.index}"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.this.function_name
principal = "events.amazonaws.com"
source_arn = var.eventbridge_execution_allowed_arns[count.index]
# This forces Terraform to recreate permissions when Lambda changes
lifecycle {
replace_triggered_by = [
aws_lambda_function.this
]
}
}
The Module Boundary Problem
However, we hit a snag with S3-triggered Lambdas. We had some permissions defined outside the Lambda module:
# In the main terraform config (outside the module)
resource "aws_lambda_permission" "s3_invoke" {
function_name = module.my_lambda.lambda_name
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.my_bucket.arn
lifecycle {
replace_triggered_by = [
module.my_lambda.aws_lambda_function.this # ❌ This doesn't work!
]
}
}
The problem: replace_triggered_by can only reference direct resources, not module outputs. Even if you expose the Lambda resource as an output, you can’t use it in replace_triggered_by across module boundaries.
Visualizing the Module Boundary Issue
The Final Solution: Move Permissions Into the Module
We solved this by moving all permission creation into the Lambda module:
Step 1: Add an optional parameter for S3 buckets
# modules/lambda/variable.tf
variable "s3_execution_allowed_arns" {
description = "List of S3 bucket ARNs allowed to invoke this Lambda"
type = list(string)
default = null
}
Step 2: Create S3 permissions inside the module
# modules/lambda/main.tf
resource "aws_lambda_permission" "s3_execution_allowed" {
count = var.s3_execution_allowed_arns != null ? length(var.s3_execution_allowed_arns) : 0
statement_id = "AllowExecutionFromS3Bucket_${count.index}"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.this.function_name
principal = "s3.amazonaws.com"
source_arn = var.s3_execution_allowed_arns[count.index]
lifecycle {
replace_triggered_by = [
aws_lambda_function.this
]
}
}
Step 3: Update Lambda configurations to use the module parameter
module "my_lambda" {
source = "./modules/lambda"
# ... other parameters ...
s3_execution_allowed_arns = [aws_s3_bucket.my_bucket.arn]
}
# Remove the external aws_lambda_permission resource entirely
Before and After Architecture
The Results
After implementing this solution:
- All Lambda permissions are now co-located with the Lambda resource
replace_triggered_byworks correctly since everything is in the same module- No more state drift - permissions are recreated in the same run as the Lambda
- Consistent pattern - EventBridge, API Gateway, and S3 permissions all managed the same way
When we run terraform plan and the Lambda needs replacement, we now see:
Terraform will perform the following actions:
# module.my_lambda.aws_lambda_function.this will be replaced
# module.my_lambda.aws_lambda_permission.eventbridge_execution_allowed[0] will be replaced
# module.my_lambda.aws_lambda_permission.s3_execution_allowed[0] will be replaced
All in the same plan! No more one-run delay, no more missing triggers.
Key Takeaways
AWS automatically deletes Lambda permissions when the Lambda is deleted - this is by design, not a bug
Terraform doesn’t always detect this deletion during the replacement plan - it only shows up on the next run
replace_triggered_byis the correct solution - but it only works within the same module/configurationModule boundaries matter - you can’t use
replace_triggered_byacross module boundaries, even with outputsCo-locate dependent resources - keep tightly coupled resources (like Lambdas and their permissions) in the same module
What About Older Terraform Versions?
If you’re stuck on Terraform < 1.2, you have a few options, though none are as clean as replace_triggered_by:
- Document the behavior: Accept the one-run delay and make sure your team knows to run apply twice after Lambda replacements
- Manual tainting: Use
terraform tainton permission resources when you know a Lambda will be replaced - Wrapper scripts: Create automation that handles the two-step apply process
- Use Terraform Cloud: The drift detection features can help catch these issues
That said, if you can upgrade to Terraform 1.2+, it’s worth it just for this feature alone.
Monitoring and Prevention
After this incident, we also set up CloudWatch alarms to catch permission issues faster. We now monitor Lambda invocation failures and compare expected vs actual EventBridge trigger counts. It won’t prevent the issue, but at least we’ll know immediately if something goes wrong.
References
- Terraform
replace_triggered_bydocumentation - Stack Overflow: Lambda permission recreation causing downtime
- Stack Overflow: Lambda permission replaced every apply
Wrapping Up
The replace_triggered_by solution is defensive infrastructure - it protects against not just this specific issue, but any future scenario where Lambdas need to be replaced. Given how often infrastructure evolves (VPC changes, runtime updates, package type migrations), that peace of mind is worth it.
If you’re managing Lambda functions with Terraform and EventBridge or S3 triggers, implement this pattern before you run into drift issues. Your future self will thank you.