ECS one-shot tasks with ecspresso
June 8, 2026
ECS one-shot tasks with ecspresso
I recently spent a good amount of time building the operational side of a small AWS project called aws-log-practice.
The application itself is not the interesting part here. The interesting part is the infrastructure and deployment workflow around it:
- ECS Fargate
- RDS for PostgreSQL in private subnets
- Secrets Manager
- ECR
- GitHub Actions with OIDC
- Atlas for database migrations
- ecspresso for ECS service deployment and one-shot tasks
The final shape is this:
- Application and batch images are built in GitHub Actions and pushed to ECR.
- ECS service deployment is handled by ecspresso.
- Operational batch jobs run as ECS one-shot tasks.
- Database migrations are detected in CI, checked for drift, and then applied by an ECS one-shot task.
The key design decision was to never connect to RDS directly from GitHub Actions.
RDS stays private. CI only builds images and asks ECS to run tasks inside the VPC.
The problem I wanted to solve
I had two operational tasks that needed to touch the database.
The first one was DB application user management.
The app should not use the RDS admin user. It should have its own least-privilege user. That user needs to be created, updated when its password changes, and granted permissions for existing and future tables/sequences.
The second one was database migration.
I wanted GitHub Actions to detect newly added migration files and apply them automatically, but only after checking that the current remote RDS schema still matches what the migration history says it should be.
The obvious but bad approach is:
GitHub Actions runner
-> connect directly to RDS
-> run SQL / atlas migrate apply
That would force me to make private RDS reachable from CI somehow. I did not want that.
So I flipped the direction.
GitHub Actions runner
-> build and push an image
-> ask ECS to run a task
-> ECS task connects to RDS inside the VPC
This is the main idea of the design.
Why ecspresso
Terraform is already managing the long-lived infrastructure:
- VPC
- subnets
- ALB
- ECS cluster
- ECR repository
- RDS instance
- Secrets Manager secrets
- IAM roles
- CloudWatch Logs
- GitHub Actions OIDC role
At first, it is tempting to also manage every ECS task definition with Terraform.
But one-shot tasks have values that change frequently:
- image URI
- command
- migration base version
- task definition file used for a specific operation
I did not want to run terraform apply every time I wanted to deploy an app image or execute a one-shot task.
So I split the responsibilities.
| Tool | Responsibility |
|---|---|
| Terraform | Long-lived AWS infrastructure |
| ecspresso | ECS service deployment and task execution |
| GitHub Actions | Build images, push to ECR, run ecspresso |
ecspresso is especially useful because it can read values from Terraform state.
The config is small:
region: ap-northeast-1
cluster: aws-log-practice-dev-ecs-cluster
service: aws-log-practice-dev-ecs-service
service_definition: ecs-service-def.json
task_definition: ecs-task-def.json
timeout: "10m0s"
plugins:
- name: tfstate
config:
url: s3://aws-log-practice-remote-backend-dev/terraform/dev/aws/terraform.state
Then task definitions can reference Terraform outputs and resources:
{
"image": "{{ must_env `IMAGE_URI` }}",
"environment": [
{
"name": "DB_HOST",
"value": "{{ tfstate `output.db_primary_host` }}"
},
{
"name": "DB_PORT",
"value": "{{ tfstate `output.db_port` }}"
},
{
"name": "DB_NAME",
"value": "{{ tfstate `output.primary_db_name` }}"
}
],
"executionRoleArn": "{{ tfstate `aws_iam_role.task_execution.arn` }}",
"taskRoleArn": "{{ tfstate `aws_iam_role.task.arn` }}"
}
This matters a lot.
I do not need to copy the RDS endpoint, DB port, log group name, IAM role ARN, subnet IDs, or security group IDs into GitHub Actions secrets.
Terraform owns infrastructure values. ecspresso reads them. GitHub Actions only passes the image URI and small runtime parameters.
That is the clean separation I wanted.
Why ECS one-shot tasks
DB user synchronization and migrations are not services. They do not need to keep running.
They should start, do one thing, log the result, and exit.
That maps naturally to ECS one-shot tasks.
There are four practical reasons I like this approach.
1. RDS can remain private.
The task runs in the same VPC as RDS. The RDS security group only needs to allow traffic from the ECS task security group.
2. Secrets are read at runtime.
The Docker image does not contain database credentials. GitHub Actions does not receive database passwords. The ECS task role reads Secrets Manager at runtime.
3. Logs go to CloudWatch Logs.
The task is short-lived, but the result is still visible in the same logging system as the application.
{
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "{{ tfstate `aws_cloudwatch_log_group.ecs.name` }}",
"awslogs-region": "ap-northeast-1",
"awslogs-stream-prefix": "batch/create-db-app-user"
}
}
}
4. Operational code is versioned as an image.
The batch binary, SQL templates, Atlas binary, and migration files are all tied to a specific image tag.
The tag format makes the purpose clear:
app-<short-sha>-<yyyymmddHHMMSS>
batch-create-db-app-user-<short-sha>-<yyyymmddHHMMSS>
batch-migrate-db-<short-sha>-<yyyymmddHHMMSS>
One binary, multiple commands
The batch program has one entrypoint and multiple commands.
func realMain(ctx context.Context) error {
if len(os.Args) != 2 {
return fmt.Errorf("usage: batch [create_db_app_user|check_db_app_user|check_db_migration_drift|apply_db_migration]")
}
switch os.Args[1] {
case "create_db_app_user":
return CreateDBAppUser(ctx)
case "check_db_app_user":
return CheckDBAppUser(ctx)
case "check_db_migration_drift":
return CheckDBMigrationDrift(ctx)
case "apply_db_migration":
return ApplyDBMigration(ctx)
default:
return fmt.Errorf("invalid command: %s", os.Args[1])
}
}
Each ECS task definition chooses the command:
{
"entryPoint": ["/batch"],
"command": ["create_db_app_user"]
}
This keeps the image structure simple. Adding a new operational command means adding a Go function and a task definition.
Creating and checking the application DB user
The application should use only the app DB credential.
The admin credential is reserved for operational tasks like:
- creating the app DB user
- changing the app DB user’s password
- granting permissions
- running migrations
The create_db_app_user task does this:
- Reads the admin credential from Secrets Manager.
- Reads the app credential from Secrets Manager.
- Connects to RDS as the admin user.
- Renders a SQL template.
- Executes the SQL inside a read-write transaction.
The task definition receives secret IDs, not passwords:
{
"environment": [
{
"name": "DB_ADMIN_CREDENTIAL_ID",
"value": "{{ must_env `DB_ADMIN_CREDENTIAL_ID` }}"
},
{
"name": "DB_APP_CREDENTIAL_ID",
"value": "{{ must_env `DB_APP_CREDENTIAL_ID` }}"
}
]
}
At runtime, the batch reads AWSCURRENT from Secrets Manager:
input := &secretsmanager.GetSecretValueInput{
SecretId: aws.String(secretID),
VersionStage: aws.String("AWSCURRENT"),
}
The SQL template is idempotent:
DO $create_db_app_user$
BEGIN
IF NOT EXISTS (
SELECT 1
FROM pg_catalog.pg_roles
WHERE rolname = {{ .UsernameLiteral }}
) THEN
CREATE ROLE {{ .UsernameIdent }} WITH LOGIN PASSWORD {{ .PasswordLiteral }};
ELSE
ALTER ROLE {{ .UsernameIdent }} WITH LOGIN PASSWORD {{ .PasswordLiteral }};
END IF;
END
$create_db_app_user$;
GRANT CONNECT ON DATABASE {{ .DatabaseIdent }} TO {{ .UsernameIdent }};
GRANT USAGE ON SCHEMA {{ .SchemaIdent }} TO {{ .UsernameIdent }};
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA {{ .SchemaIdent }} TO {{ .UsernameIdent }};
GRANT USAGE, SELECT, UPDATE ON ALL SEQUENCES IN SCHEMA {{ .SchemaIdent }} TO {{ .UsernameIdent }};
ALTER DEFAULT PRIVILEGES IN SCHEMA {{ .SchemaIdent }} GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO {{ .UsernameIdent }};
ALTER DEFAULT PRIVILEGES IN SCHEMA {{ .SchemaIdent }} GRANT USAGE, SELECT, UPDATE ON SEQUENCES TO {{ .UsernameIdent }};
The important part is that it grants permissions for both existing objects and future objects.
After a migration creates new tables or sequences, I rerun create_db_app_user to sync existing-object privileges.
Then I run check_db_app_user.
That task connects as the application DB user and verifies database, schema, table, and sequence privileges.
A successful run looks like this:
{
"level": "INFO",
"msg": "checked db app user privileges",
"db_name": "aws_log_practice_primary",
"schema": "public",
"username": "aws_log_practice_dev_db_app",
"checks_count": 18
}
This is not just a smoke test. It proves the application credential can actually do what the app needs.
Migration workflow
Migration is more sensitive than normal application deployment.
I wanted CI to apply migrations automatically, but not blindly.
The workflow is:
- Detect newly added migration files.
- Reject changes to existing migration files.
- Build and push a migration image.
- Run drift check in ECS.
- Apply migrations in ECS only if drift check passes.
Here is the simplified GitHub Actions structure:
jobs:
detect_new_migrations:
runs-on: ubuntu-latest
outputs:
has_added_migration: ${{ steps.detect.outputs.has_added_migration }}
base_version: ${{ steps.detect.outputs.base_version }}
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
- id: detect
run: scripts/detect-new-migrations.sh "${BASE_REF}" "${GITHUB_SHA}"
migrate-db:
needs: [detect_new_migrations]
if: ${{ needs.detect_new_migrations.outputs.has_added_migration == 'true' || inputs.force_run == true }}
steps:
- name: Build and push migration image
uses: docker/build-push-action@v7
with:
context: .
file: server/cmd/batch/Dockerfile.migration
platforms: linux/amd64
push: true
- name: Check DB migration drift
working-directory: ecspresso
run: |
ecspresso run \
--config ecspresso.config.yaml \
--task-def migration/ecs-check-db-migration-drift-task-def.json \
--wait
- name: Apply DB migration
working-directory: ecspresso
run: |
ecspresso run \
--config ecspresso.config.yaml \
--task-def migration/ecs-apply-db-migration-task-def.json \
--wait
DB_ADMIN_CREDENTIAL_ID is hardcoded in the workflow:
env:
AWS_REGION: ap-northeast-1
ECR_REPOSITORY: aws-log-practice-dev-ecr-repository
DB_ADMIN_CREDENTIAL_ID: aws-log-practice/dev/db/admin
That is fine because it is not a credential. It is the name of a Secrets Manager secret.
The actual password is read by the ECS task at runtime.
Detecting migration changes
The migration detection script intentionally allows only new migration files.
It inspects Git diff status under db/migrations:
git diff --name-status "${base_ref}" "${head_ref}" -- db/migrations
Only A is accepted.
Changes like M, D, R, or C fail the workflow.
That rule exists because applied migration files should be immutable. If a migration file already applied to RDS is edited in Git, the repository and the actual database history no longer tell the same story.
The script also calculates MIGRATION_BASE_VERSION.
For drift check, I want to compare:
remote RDS schema now
vs
shadow database with all migrations except the newly added ones
So the script subtracts newly added migration files and takes the latest remaining version:
comm -23 \
<(git ls-files 'db/migrations/*.sql' | sort) \
<(printf '%s\n' "${ADDED_MIGRATIONS[@]}" | sort) |
sed -E 's#^.*/([0-9]+)_.+$#\1#' |
sort |
tail -n 1
That version becomes MIGRATION_BASE_VERSION.
Drift check with a shadow database
The drift check task has two containers:
check-db-migration-driftshadow-postgres
The batch container waits for the shadow database to become healthy:
{
"dependsOn": [
{
"condition": "HEALTHY",
"containerName": "shadow-postgres"
}
]
}
The batch then prepares two local databases inside the sidecar PostgreSQL container:
shadowshadow_dev
It applies migrations up to MIGRATION_BASE_VERSION to shadow.
if cfg.MigrationBaseVersion != "" {
_, err := runAtlas(ctx, cfg.AtlasPath,
"migrate",
"apply",
"--url", cfg.ShadowDatabaseURL,
"--dir", migrationDirURL(cfg.MigrationDir),
"--to-version", cfg.MigrationBaseVersion,
)
if err != nil {
return err
}
}
Then it compares remote RDS with the shadow database:
diff, err := runAtlas(ctx, cfg.AtlasPath,
"schema",
"diff",
"--from", targetURL,
"--to", cfg.ShadowDatabaseURL,
"--dev-url", cfg.DevDatabaseURL,
"--schema", cfg.Schema,
"--exclude", "atlas_schema_revisions",
"--format", "{{ sql . }}",
)
if strings.TrimSpace(diff) != "" {
return fmt.Errorf("database schema drift detected")
}
If drift exists, migration apply stops.
That is the safety gate.
When drift exists, the log contains the SQL diff:
-- Drop "orders" table
DROP TABLE "public"."orders";
-- Drop "stocks" table
DROP TABLE "public"."stocks";
At that point, a human should inspect why the remote schema differs from migration history.
Migration image
The migration image includes:
- batch binary
- Atlas binary
atlas.hcldb/schema.sqldb/migrations
The Dockerfile looks like this:
FROM golang:1.26.4 AS builder
WORKDIR /src/server/cmd/batch
COPY server/cmd/batch/go.mod server/cmd/batch/go.sum ./
RUN --mount=type=cache,target=/go/pkg/mod go mod download -x
COPY server/cmd/batch/ ./
RUN --mount=type=cache,target=/go/pkg/mod \
--mount=type=cache,target=/root/.cache/go-build \
CGO_ENABLED=0 GOOS=linux go build -o /artifacts/batch ./
FROM debian:13-slim AS atlas
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& rm -rf /var/lib/apt/lists/*
RUN curl -sSf https://atlasgo.sh | sh -s -- -y --no-install -o /usr/local/bin/atlas \
&& chmod +x /usr/local/bin/atlas
FROM debian:13-slim
WORKDIR /app
COPY --from=builder /artifacts/batch /batch
COPY --from=atlas /usr/local/bin/atlas /usr/local/bin/atlas
COPY atlas.hcl /app/atlas.hcl
COPY db/schema.sql /app/db/schema.sql
COPY db/migrations /app/db/migrations
CMD ["/batch"]
I actually hit this error once:
fork/exec /usr/local/bin/atlas: permission denied
The fix was simple:
RUN curl -sSf https://atlasgo.sh | sh -s -- -y --no-install -o /usr/local/bin/atlas \
&& chmod +x /usr/local/bin/atlas
This is the kind of operational failure that is easy to debug when task logs are centralized in CloudWatch Logs.
IAM boundaries
There are three separate IAM identities involved.
GitHub Actions OIDC role
- Push images to ECR
- Read Terraform state from S3
- Register ECS task definitions
- Run ECS tasks
- Deploy ECS services
- Read CloudWatch Logs during ecspresso execution
ECS task execution role
- Pull images from ECR
- Write container logs to CloudWatch Logs
- Inject task definition secrets when needed
ECS task role
- Read admin/app DB secrets from Secrets Manager at runtime
One important point:
An ECS task does not need an IAM permission like “RDS write” to run SQL against PostgreSQL.
It needs network access to RDS and valid database credentials.
Database authorization is handled by PostgreSQL roles and grants, not IAM.
CI/CD shape
The final CI/CD shape is:
GitHub Actions
├─ detect migration changes
├─ build image with buildx
├─ push image to ECR
└─ ecspresso run
├─ read Terraform state
├─ register ECS task definition
└─ run ECS task in private subnet
├─ read Secrets Manager at runtime
├─ connect to private RDS
├─ write logs to CloudWatch Logs
└─ exit 0 or 1
For the backend application image, the workflow is similar:
GitHub Actions
├─ build backend image
├─ push app-<short-sha>-<timestamp> to ECR
└─ ecspresso deploy
Batch and app images share one ECR repository, but the tag prefix makes the type obvious:
aws-log-practice-dev-ecr-repository:app-c1d5db9-20260607143000
aws-log-practice-dev-ecr-repository:batch-create-db-app-user-a4b8d7f-20260607121122
aws-log-practice-dev-ecr-repository:batch-migrate-db-5602045-20260607135231
Lessons learned
I hit a few concrete issues while building this.
ecr:GetAuthorizationToken requires resource *.
ECR login failed with:
not authorized to perform: ecr:GetAuthorizationToken on resource: *
That action cannot be scoped to a single repository ARN.
Secrets Manager runtime reads belong to the task role.
If the container process calls Secrets Manager, the permission belongs on the ECS task role, not only the task execution role.
ECS health check retries have a limit.
I initially set the shadow-postgres health check retries to 12. ECS rejected the task definition. The value had to be 10 or less.
AWS environment variables can override AWS_PROFILE.
I saw InvalidAccessKeyId even though I passed AWS_PROFILE=taichi-aws-log-practice.
The cause was stale AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, or AWS_SESSION_TOKEN in the shell environment.
Unsetting them fixed it.
What I like about the result
The design is not complicated once the responsibility boundaries are clear.
- Terraform creates long-lived infrastructure.
- ecspresso turns Terraform state into ECS task definitions.
- GitHub Actions builds images and starts ECS operations.
- Secrets Manager values are read only at runtime.
- RDS stays in private subnets.
- Migration apply is gated by drift check.
Instead of moving CI closer to the database, I moved the database operation into ECS, where the database already is.
That is the core idea.
There are still things I want to improve.
The ecspresso task definitions have some duplication. The task role could also be split into application, batch, and migration roles for stricter least privilege.
But the current setup already gets the important parts right:
- no DB password in Git
- no DB password in Docker build
- no DB password in GitHub Actions
- no public RDS access
- migration drift check before apply
- operational logs in CloudWatch Logs
For this project, that is a good foundation.