ECS vs EKS: What Problem Do You Want to Operate

A practical comparison from production: upgrades, cronjobs, permissions, AWS limits, and the real cost of keeping a platform alive.

In production, ECS vs EKS is rarely decided with a feature table.

It is decided when you need to deploy a change on a Friday.
When a cronjob runs more times than expected.
When AWS starts returning Rate exceeded.
When you need to upgrade hundreds of nodes.
When a development team needs to move fast without understanding every piece of the platform.

In theory, the comparison is simple:

ECS is simpler and more integrated with AWS
EKS is managed Kubernetes
Kubernetes is more portable
ECS has less operational surface area

All of that is true.

But it is also incomplete.

The real question is not which one has more features.
The real question is: what kind of problems do you want to operate every day?

ECS feels closer to AWS

ECS has one quality that matters a lot in production: it feels like a natural part of AWS.

You define services, tasks, roles, security groups, load balancers, logs, autoscaling, secrets, and permissions from infrastructure. If you use Terraform, Pulumi, CloudFormation, or CDK, a large part of the system lives in the same mental model.

You can also manage deployments with tools like ecspresso, which make ECS comfortable to operate without having to build a huge platform around it.

In many teams that already live inside AWS, the practical result is that the development team can move faster.

Not because ECS is magic.

But because there are fewer concepts between “I want to run this service” and “the service is running”.

You do not need to explain namespaces, service accounts, ingress controllers, CRDs, Helm charts, admission controllers, node groups, taints, tolerations, kubectl, contexts, kubeconfigs, and the small universe of YAML that appears around Kubernetes.

That does not mean ECS has no complexity.

It means a lot of that complexity stays inside AWS and is expressed through resources the team is probably already using.

EKS opens more doors

EKS, on the other hand, gives you Kubernetes.

And Kubernetes is a huge platform.

That can be a real advantage.

You can use operators, service meshes, controllers, more sophisticated autoscalers, different runtimes, multi-cloud patterns, standard observability tools, policies with admission controllers, GitOps, more diverse workloads, and an ecosystem that does not depend only on AWS.

When you need that flexibility, EKS can feel much more robust.

Not because ECS is weak, but because Kubernetes lets you build a much richer platform on top.

The problem is that this richness is not free.

Everything you add to the cluster also becomes something someone has to understand, upgrade, monitor, and debug when it fails.

In EKS, you are not just running applications.

You are running a platform.

Upgrades do not feel the same

One of the biggest differences shows up over time.

In ECS, you are not thinking about upgrading the cluster in the same way.

With Fargate, AWS takes care of a lot of that work. There are events like task patching, where AWS periodically replaces tasks to apply platform or security updates. That can cause movement in your tasks, and you still need readiness, health checks, and well-configured deployments, but you are not planning a full node upgrade as a normal part of your life.

In EKS, upgrades are a different story.

Upgrading Kubernetes can feel like a complex ritual:

checking version compatibility
upgrading the control plane
upgrading add-ons like VPC CNI, CoreDNS, and kube-proxy
reviewing CRDs and controllers
validating Helm charts
upgrading node groups
draining nodes
taking care of PodDisruptionBudgets
reviewing workloads that depend on deprecated APIs
monitoring that autoscaling does not do something strange in the middle of the process

If you have a few nodes, it is manageable.

If you have hundreds and hundreds, it stops being a task you can simply launch and forget.

AWS has been improving the experience. EKS Auto Mode reduces part of the heavy lifting and changes the conversation around node management quite a bit.

But even then, I do not think an EKS upgrade is something you can leave running in the background while you go get coffee.

Not if production matters.

Cronjobs: where ECS shows its edges

ECS does not have CronJob like Kubernetes does.

If you want scheduled tasks, you usually end up using EventBridge Scheduler or EventBridge Rules to launch RunTask.

This works.

But it introduces another operational boundary.

You are no longer operating only ECS. You are also operating the integration between EventBridge, IAM, ECS, networking, quotas, and the RunTask API.

And that is where very real failures show up.

I once had a bug where around 15 cronjobs were launched at the same time. They were not heavy jobs individually, but together they hit AWS limits and errors.

Some RunTask calls failed with responses like:

Service unavailable
Rate exceeded
Throttling Exception

That kind of incident is a good reminder: serverless or managed does not mean limits disappear.

It only means the limits live somewhere else.

In Kubernetes, a CronJob lives inside the cluster. You can control concurrency policy, deadlines, retries, history limits, and observe the behavior with the same tools you use for the rest of the cluster.

In ECS, you need to design that part explicitly:

retries with backoff
concurrency limits
job idempotency
alerts for RunTask failures
DLQs if the flow needs them
quotas reviewed before growth
separation between critical and non-critical schedules

The common mistake is thinking EventBridge plus ECS is exactly equivalent to a Kubernetes CronJob.

It is not.

You can operate it well, but you need to treat it as a distributed integration, not as a native ECS feature.

Permissions: simple does not mean easy

In ECS, almost everything is managed from infrastructure:

task roles
execution roles
IAM policies
security groups
secrets
log groups
load balancers
target groups

That is usually a huge advantage.

The mental model is closer to AWS: this task needs to read this secret, write to this bucket, publish to this queue, and talk to this database.

The problem is that IAM is still IAM.

You can make permissions too broad.
You can mix up the execution role and the task role.
You can forget permissions to pull images or write logs.
You can break a deployment because the task cannot read a secret.
You can have networking correct in the application but wrong in the security groups.

ECS reduces layers, but it does not remove responsibility.

In EKS, the model becomes wider.

You have IAM, but also Kubernetes RBAC, service accounts, IRSA or EKS Pod Identity, namespaces, policies, admission controllers, and secrets inside the cluster.

That can give you more control.

It also gives you more places to make mistakes.

A pod can have the right permissions in Kubernetes but not in AWS.
Or the right permissions in AWS but not in Kubernetes.
Or network access blocked by a NetworkPolicy.
Or a chart that creates resources with broader permissions than you expected.

EKS is not just “adding YAMLs”.

It is building a full configuration management layer: connection to the cluster, credentials, permissions, templates, validations, drift detection, GitOps or pipelines capable of applying changes without depending on someone having the right context on their machine.

And once you bring CI/CD into the picture, that layer also needs its own workflows, tests, access patterns, and networking.

If you run EKS on private nodes, connecting your CI/CD to the cluster in a secure and maintainable way becomes a topic on its own.

That is powerful.

But it is also work.

The common failures are not the same

In ECS, common problems tend to live around AWS integration:

tasks that do not start because a permission is missing in the execution role
images that cannot be pulled from ECR
secrets referenced incorrectly
health checks that are too aggressive
target groups killing tasks before they are ready
ENI issues or networking limits in Fargate
autoscaling based on a metric that does not represent real load
schedules launching too many tasks at the same time
RunTask quotas, API throttling, or insufficient capacity

In EKS, common problems tend to live around the platform:

nodes that do not drain properly because PodDisruptionBudgets are poorly defined
workloads without reasonable requests and limits
HPA reacting late or scaling on the wrong metric
IP exhaustion with the VPC CNI
outdated add-ons
charts with copied values nobody fully understands
CRDs blocking upgrades
controllers failing and affecting many services
permissions split between Kubernetes and AWS
pipelines that need private access to the cluster
too much logic hidden in YAMLs or Helm templates

The difference is not that one fails and the other does not.

The difference is where you need to look when something fails.

Speed vs operational surface area

ECS lets many teams move quickly because it reduces the amount of platform they need to understand.

For common web services, workers, APIs, small jobs, and teams already deep in AWS, ECS with Fargate can be a great decision.

It forces you to solve fewer things.

And in production, that is a virtue.

EKS becomes more attractive when you need more than running containers:

heterogeneous workloads
the Kubernetes ecosystem
operators
more advanced platform patterns
more control over scheduling
organizational portability
an internal platform shared by many teams

But EKS requires a team in constant monitoring mode.

Not necessarily an army, but real ownership.

Someone has to take care of upgrades.
Someone has to understand nodes.
Someone has to review add-ons.
Someone has to observe the cluster, not just the applications.
Someone has to maintain the paved road so developers do not end up fighting Kubernetes every day.

Without that, EKS can become an elegant way to distribute complexity to every team.

So, which one should you choose?

If the question is “which one is better?”, I think the honest answer is: it depends on what you are willing to operate.

ECS is a good choice when you want AWS to absorb more complexity and your priority is moving services to production with less platform around them.

EKS is a good choice when Kubernetes is part of the strategy, not just a sophisticated way to run containers.

The trap is choosing EKS because it looks more robust without accepting the operational cost.

The other trap is choosing ECS thinking that because it is simpler, you no longer need operational design.

Both fail.

ECS still needs good practices: limits, retries, idempotency, health checks, observability, careful IAM, and reviewed quotas.

EKS still needs humility: planned upgrades, clear ownership, automation, policies, living documentation, and a platform that does not force every developer to become a Kubernetes expert.

In the end, the decision is rarely just technical.

It is about the team, the product, the operational maturity, and the kind of problems you are willing to carry every day.

Because production does not care which orchestrator you chose.

It only cares whether someone knows how to operate it.