For a long time, services have relied on the traditional username/password authentication model to connect to databases. Most developer frameworks work very well with this pattern, providing built-in mechanisms to read credentials from environment variables, configuration files, or class-level settings. This creates a simple and smooth developer experience — but there is a trade-off as the real challenge is not using credentials, it’s distributing and protecting them.
In our case, we followed a common approach:
- Credentials were stored in AWS Secrets Manager.
- The secret was fetched at service startup.
- Amazon RDS handled periodic secret rotation and supported on-demand rotation when needed.
This setup worked “fine”… until we hit an error while running at low scale. It was an isolated incident:
SQLException. Message: Connect failed to authenticate: reached max connection retries
From the log alone, we didn’t have much context. To understand what happened, we correlated events from CloudTrail, application logs, and ECS task history to reconstruct the sequence of events:
sequenceDiagram
participant ECS as ECS Cluster
participant Pod as Service Pod
participant Secrets as AWS Secrets Manager
participant RDS as Amazon RDS
ECS->>Pod: Service scales up due to heavy load
activate Secrets
Note over Secrets: Password rotation started
loop Retry until max attempts reached
Pod->>Secrets: Fetch credentials
Secrets-->>Pod: Returns stale credentials
Pod->>RDS: Attempt connection
RDS-->>Pod: Authentication failed
end
Pod-->>ECS: Startup failed:<br/>Max retry attempts reached
Note over Secrets: Password rotation ends:<br/>DB and Secrets in sync
deactivate Secrets
ECS->>Pod: Pod scheduled for recreation
After reviewing the timeline, several key observations stood out:
- The error came from a single pod; overall service health remained “ok”.
- No other services were affected.
- During rotation, there’s a brief window where the database password and the stored secret can be out of sync.
- At larger scale, frequent secret fetches, connection retries, and rotation events can increase the risk of hitting API rate limits, add cost, and introduce throttling.
- Most importantly, this occurred during a scaling event. The service needed additional capacity to handle load, but the new pod failed to start and had to be replaced — delaying the exact moment when extra resources were needed most.
Given these insights, the team decided to revisit the authentication flow. Some proposed options included:
- Extending the rotation interval (for example, from days to weeks or months) to reduce how often this window could occur.
- Improving retry logic with exponential backoff or jitter.
Both were reasonable workarounds, but they felt like patches. Extending rotation intervals weakens security posture, while more aggressive retries make it harder to distinguish rotation issues from real configuration or connectivity problems — especially as the number of databases and application users grows.
Credentials are simple, but never free
Traditional database authentication has a few well-known issues:
- Long-lived credentials increase blast radius
- Rotation adds operational complexity. RDS can rotate the master password, but for custom database users we still need to manage rotation logic ourselves.
- Secrets distribution becomes part of the runtime path
- Debugging auth failures often involves multiple systems
Even with managed services like Secrets Manager, credentials remain static artifacts we must fetch, cache, and protect.
Complexity in our systems is inevitable but depending on the context (the team, the architecture, the tools, etc.) we can move it around to harden some parts of our systems.
The question I wanted to answer was simple:
Can we remove static database passwords with minimum friction?
From credentials to roles
Role-Based Access Control (RBAC) isn’t new, it pushes credentials complexity down to a layer that tools handle transparently. It’s a common pattern in infrastructure, but not widely used on development (I think this is because in general there is still a barrier between infra and development teams but maybe that is another post).
For our case, IAM database authentication replaces static passwords with short-lived auth tokens, generated on demand. The complexity moves from password distribution and rotation to the IAM layer.
sequenceDiagram
participant Pod as Service Pod
participant RDS as Amazon RDS
participant IAM
Pod-->>Pod: Generate IAM auth token (SigV4)
Pod->>RDS: Connect using username + token
RDS->>IAM: Validate token and permissions
IAM-->>RDS: Signature valid
RDS-->>Pod: Connection successful
On paper, it offers:
- No stored database passwords
- Automatic expiration
- IAM-based access control
- Easier credential revocation
But there’s always a catch.
The common concern I hear is:
That sounds good, but isn’t it slower or more complex?
The experiment setup
I built a small Java application that connects to RDS in two ways:
-
Traditional approach
- Fetch credentials from Secrets Manager
- Use JDBC with username/password
-
IAM authentication
- Generate an auth token using AWS SDK
- Use the token as the database password
The goal wasn’t to optimize performance to the extreme — just to observe realistic connection behavior.
Measuring the cost
Secrets-based authentication
Secret fetch time: 1260 ms
Connection time: 780 ms
SecretsManagerClient closed.
Total overhead is split between:
- Network call to Secrets Manager
- Database connection establishment
IAM-based authentication
Token generation time: 877 ms
Connection time (JDBC connect): 760 ms
Connection closed.
Here, the cost shifts slightly:
- Token generation replaces secret retrieval
- Database connection time remains roughly the same
What changed
- No stored database passwords
- No secret rotation logic
- Access control moved to roles
- Easier auditing and revocation
From a security and operational standpoint, this was a clear win for our use case.
What didn’t change much
- Database connection latency stayed similar
- Application code changes were minimal
- Developers still pass a username and “password” (now a token)
This was important: adopting IAM auth did not require a mental model shift for application teams.
The code impact (smaller than expected)
Another concern is:
This will require a big refactor.
In reality, the change was localized:
getPropertyFromSecret(prop) {
SecretsManagerClient smClient = SecretsManagerClient.builder().region("ap-northeast-1").credentialsProvider(DefaultCredentialsProvider.create()).build();
GetSecretValueRequest req = GetSecretValueRequest.builder().secretId(DB_SECRET_NAME).build();
GetSecretValueResponse resp = smClient.getSecretValue(req);
JSONObject json = new JSONObject(resp.secretString());
return json.getString(prop);
}
generateAuthToken() {
RdsUtilities utilities = RdsUtilities.builder().region("ap-northeast-1").credentialsProvider(DefaultCredentialsProvider.create()).build();
GenerateAuthenticationTokenRequest tokenRequest = GenerateAuthenticationTokenRequest.builder().hostname(DB_HOST).port(DB_PORT).username(DB_USER).build();
return utilities.generateAuthenticationToken(tokenRequest);
}
// String password = getPropertyFromSecret("password");
String password = generateAuthToken();
Connection conn = DriverManager.getConnection(jdbcUrl, DB_USER, password);
The main difference is where the password comes from, not how the connection works.
Adoption matters more than elegance
From a purely technical perspective, IAM authentication is not revolutionary.
What makes it valuable is this combination:
- Reduced credential risk
- Minimal developer friction
- Clear operational boundaries
Security improvements that are hard to adopt usually fail. This one doesn’t need heroics.
When IAM authentication makes sense
It’s a good fit if:
- You already run workloads on AWS
- You use IAM roles for compute (ECS, EC2, EKS, Lambda)
- You want to reduce long-lived credentials
Tip: For high-connection or serverless workloads, combine it with RDS Proxy — it handles token generation and refresh transparently, with no extra code.
It might not be ideal if:
- You connect from outside AWS
- You rely heavily on connection pooling without token refresh logic
Context matters.
Final thoughts
This experiment didn’t magically speed up database connections. That wasn’t the goal.
What it did was remove an entire class of risk without making the system harder to operate.
For me, that’s the kind of trade-off worth making: small changes, measurable impact, and fewer things that can fail at 3 a.m.