Reducing Credential Risk Without Breaking DX | Leonel Castañeda

For a long time, services have relied on the traditional username/password authentication model to connect to databases. Most developer frameworks work very well with this pattern, providing built-in mechanisms to read credentials from environment variables, configuration files, or class-level settings. This creates a simple and smooth developer experience — but there is a trade-off as the real challenge is not using credentials, it’s distributing and protecting them.

In our case, we followed a common approach:

Credentials were stored in AWS Secrets Manager.
The secret was fetched at service startup.
Amazon RDS handled periodic secret rotation and supported on-demand rotation when needed.

This setup worked “fine”… until we hit an error while running at low scale. It was an isolated incident:

SQLException. Message: Connect failed to authenticate: reached max connection retries

From the log alone, we didn’t have much context. To understand what happened, we correlated events from CloudTrail, application logs, and ECS task history to reconstruct the sequence of events:

sequenceDiagram
    participant ECS as ECS Cluster
    participant Pod as Service Pod
    participant Secrets as AWS Secrets Manager
    participant RDS as Amazon RDS

    ECS->>Pod: Service scales up due to heavy load
    
    activate Secrets
    Note over Secrets: Password rotation started

    loop Retry until max attempts reached
        Pod->>Secrets: Fetch credentials
        Secrets-->>Pod: Returns stale credentials

        Pod->>RDS: Attempt connection
        RDS-->>Pod: Authentication failed
    end
    
    Pod-->>ECS: Startup failed:<br/>Max retry attempts reached

    Note over Secrets: Password rotation ends:<br/>DB and Secrets in sync
    deactivate Secrets

    ECS->>Pod: Pod scheduled for recreation

After reviewing the timeline, several key observations stood out:

The error came from a single pod; overall service health remained “ok”.
No other services were affected.
During rotation, there’s a brief window where the database password and the stored secret can be out of sync.
At larger scale, frequent secret fetches, connection retries, and rotation events can increase the risk of hitting API rate limits, add cost, and introduce throttling.
Most importantly, this occurred during a scaling event. The service needed additional capacity to handle load, but the new pod failed to start and had to be replaced — delaying the exact moment when extra resources were needed most.

Given these insights, the team decided to revisit the authentication flow. Some proposed options included:

Extending the rotation interval (for example, from days to weeks or months) to reduce how often this window could occur.
Improving retry logic with exponential backoff or jitter.

Both were reasonable workarounds, but they felt like patches. Extending rotation intervals weakens security posture, while more aggressive retries make it harder to distinguish rotation issues from real configuration or connectivity problems — especially as the number of databases and application users grows.

Credentials are simple, but never free

Traditional database authentication has a few well-known issues:

Long-lived credentials increase blast radius
Rotation adds operational complexity. RDS can rotate the master password, but for custom database users we still need to manage rotation logic ourselves.
Secrets distribution becomes part of the runtime path
Debugging auth failures often involves multiple systems

Even with managed services like Secrets Manager, credentials remain static artifacts we must fetch, cache, and protect.

Complexity in our systems is inevitable but depending on the context (the team, the architecture, the tools, etc.) we can move it around to harden some parts of our systems.

The question I wanted to answer was simple:

Can we remove static database passwords with minimum friction?

From credentials to roles

Role-Based Access Control (RBAC) isn’t new, it pushes credentials complexity down to a layer that tools handle transparently. It’s a common pattern in infrastructure, but not widely used on development (I think this is because in general there is still a barrier between infra and development teams but maybe that is another post).

For our case, IAM database authentication replaces static passwords with short-lived auth tokens, generated on demand. The complexity moves from password distribution and rotation to the IAM layer.

sequenceDiagram
    participant Pod as Service Pod
    participant RDS as Amazon RDS
    participant IAM

    Pod-->>Pod: Generate IAM auth token (SigV4)
    Pod->>RDS: Connect using username + token
    RDS->>IAM: Validate token and permissions
    IAM-->>RDS: Signature valid
    RDS-->>Pod: Connection successful

On paper, it offers:

No stored database passwords
Automatic expiration
IAM-based access control
Easier credential revocation

But there’s always a catch.

The common concern I hear is:

That sounds good, but isn’t it slower or more complex?

The experiment setup

I built a small Java application that connects to RDS in two ways:

Traditional approach
- Fetch credentials from Secrets Manager
- Use JDBC with username/password
IAM authentication
- Generate an auth token using AWS SDK
- Use the token as the database password

The goal wasn’t to optimize performance to the extreme — just to observe realistic connection behavior.

Measuring the cost

Secrets-based authentication

Secret fetch time: 1260 ms
Connection time: 780 ms
SecretsManagerClient closed.

Total overhead is split between:

Network call to Secrets Manager
Database connection establishment

IAM-based authentication

Token generation time: 877 ms
Connection time (JDBC connect): 760 ms
Connection closed.

Here, the cost shifts slightly:

Token generation replaces secret retrieval
Database connection time remains roughly the same

What changed

No stored database passwords
No secret rotation logic
Access control moved to roles
Easier auditing and revocation

From a security and operational standpoint, this was a clear win for our use case.

What didn’t change much

Database connection latency stayed similar
Application code changes were minimal
Developers still pass a username and “password” (now a token)

This was important: adopting IAM auth did not require a mental model shift for application teams.

The code impact (smaller than expected)

Another concern is:

This will require a big refactor.

In reality, the change was localized:

getPropertyFromSecret(prop) {
    SecretsManagerClient smClient = SecretsManagerClient.builder().region("ap-northeast-1").credentialsProvider(DefaultCredentialsProvider.create()).build();
    GetSecretValueRequest req = GetSecretValueRequest.builder().secretId(DB_SECRET_NAME).build();
    GetSecretValueResponse resp = smClient.getSecretValue(req);
    JSONObject json = new JSONObject(resp.secretString());

    return json.getString(prop);
}

generateAuthToken() {
    RdsUtilities utilities = RdsUtilities.builder().region("ap-northeast-1").credentialsProvider(DefaultCredentialsProvider.create()).build();
    GenerateAuthenticationTokenRequest tokenRequest = GenerateAuthenticationTokenRequest.builder().hostname(DB_HOST).port(DB_PORT).username(DB_USER).build();
    
    return utilities.generateAuthenticationToken(tokenRequest);
}

// String password = getPropertyFromSecret("password");
String password = generateAuthToken();

Connection conn = DriverManager.getConnection(jdbcUrl, DB_USER, password);

The main difference is where the password comes from, not how the connection works.

Adoption matters more than elegance

From a purely technical perspective, IAM authentication is not revolutionary.

What makes it valuable is this combination:

Reduced credential risk
Minimal developer friction
Clear operational boundaries

Security improvements that are hard to adopt usually fail. This one doesn’t need heroics.

When IAM authentication makes sense

It’s a good fit if:

You already run workloads on AWS
You use IAM roles for compute (ECS, EC2, EKS, Lambda)
You want to reduce long-lived credentials

Tip: For high-connection or serverless workloads, combine it with RDS Proxy — it handles token generation and refresh transparently, with no extra code.

It might not be ideal if:

You connect from outside AWS
You rely heavily on connection pooling without token refresh logic

Context matters.

Final thoughts

This experiment didn’t magically speed up database connections. That wasn’t the goal.

What it did was remove an entire class of risk without making the system harder to operate.

For me, that’s the kind of trade-off worth making: small changes, measurable impact, and fewer things that can fail at 3 a.m.