prepme.io

AWS DevOps Interview Questions and Answers

This guide is for engineers preparing for AWS-focused DevOps, SRE, platform, and cloud-infrastructure interviews — from mid-level up to staff. An AWS DevOps interview is rarely a trivia quiz about service names. Interviewers want to see that you can reason about trade-offs (ECS vs EKS, NAT Gateway vs VPC endpoints, CloudFormation vs Terraform), design for failure, write least-privilege IAM, and stay calm when production is on fire. The questions reward concrete experience: a real command you ran, the gotcha that bit you once, the cost line item you cut.

Below you get the concepts an AWS DevOps engineer is expected to know cold — compute and containers, networking and security, infrastructure-as-code and CI/CD, then observability, cost, and live scenario thinking — followed by a focused set of real interview questions with model answers and a short prep FAQ. Treat the model answers as the skeleton of what to say out loud, then make each one yours with a story from your own infra.

What an AWS DevOps interview actually tests

Beyond service definitions, an AWS DevOps interview probes four loops: how you provision (IaC and state management), how you ship (CI/CD, rollout strategy), how you secure (IAM, network boundaries, secrets), and how you operate (observability, scaling, incident response, cost). A strong candidate connects these — for example, explaining how a blue/green deploy interacts with target-group health checks, the IAM permissions the deploy role needs, and the CloudWatch alarm that triggers an automatic rollback.

Seniority changes the lens. Mid-level questions are mechanical ('how does an Auto Scaling group replace an unhealthy instance?'). Senior and staff questions are about judgment and blast radius ('we run 40 microservices on EKS, one team wants cluster-admin — what do you do, and how do you keep the cluster multi-tenant-safe?'). Answer with the decision and the reasoning, not just the feature that exists.

The fastest way to sound senior is to volunteer the failure mode. When you describe a design, name what breaks: the single-AZ NAT Gateway that becomes a SPOF, the IAM wildcard that quietly grants iam:PassRole, the missing Terraform state lock that lets two pipelines corrupt state. Interviewers hire people who think about the unhappy path before it happens.

  • Provision: Terraform/CloudFormation, remote state, drift, modules.
  • Ship: CodePipeline/GitHub Actions, immutable artifacts, rollout strategy, rollback.
  • Secure: IAM least-privilege, IRSA, security groups, secrets, encryption.
  • Operate: CloudWatch/observability, autoscaling, failover, on-call, cost.

Compute and containers: EC2, ECS vs EKS

EC2 is raw compute; the DevOps value is in what wraps it — Auto Scaling groups, launch templates, mixed instances policies, and Spot for cost. Know that launch configurations are legacy (new accounts can no longer create them — use launch templates), how an ASG's health-check type (EC2 vs ELB) decides whether it replaces an instance, and how Spot interruptions are handled with capacity rebalancing plus the two-minute interruption notice (and the earlier rebalance recommendation). For stateful nodes, understand the EBS volume lifecycle and why instance-store data is lost on stop or terminate.

ECS vs EKS is the single most common container question. ECS is AWS-native, lower operational overhead, with deep IAM integration via task roles, and two launch types: EC2 (you manage the instances) and Fargate (serverless, you manage nothing below the task). EKS is managed Kubernetes — you get the full CNCF ecosystem (Helm, operators, CRDs, GitOps with Argo CD/Flux), portability across clouds, and a larger talent pool, at the cost of an API server you pay for hourly plus the operational weight of upgrades, add-ons, and the VPC CNI. Pick ECS/Fargate when the team is small and the workload is straightforward; pick EKS when you need Kubernetes primitives, portability, or already have Kubernetes expertise.

On EKS specifically, interviewers go deep. Be ready on: the VPC CNI assigning real VPC IPs to pods (and the IP-exhaustion gotcha on small subnets, mitigated with prefix delegation or a secondary CIDR), IRSA (IAM Roles for Service Accounts) — and its successor EKS Pod Identity — so pods get scoped AWS credentials instead of node-wide instance-profile creds, managed node groups vs self-managed vs Fargate profiles, and Karpenter vs the Cluster Autoscaler for node provisioning. Know that kubectl talks to the Kubernetes API server, but AWS-to-Kubernetes authorization is mapped through EKS access entries (the recommended approach; the older aws-auth ConfigMap is deprecated).

  • ECS Fargate: no node management, per-task IAM, fastest to operate.
  • EKS: full Kubernetes, GitOps, portability — but you own upgrades and add-ons.
  • IRSA (or EKS Pod Identity) gives pods least-privilege AWS access — no node-wide creds.
  • VPC CNI gives pods VPC IPs — watch subnet IP exhaustion; use prefix delegation.
  • Karpenter often beats the Cluster Autoscaler for fast, bin-packed provisioning.

Networking and security: VPC, security groups, IAM least-privilege

A VPC question usually unfolds into a whiteboard: public and private subnets across multiple AZs, an internet gateway for public ingress, a NAT gateway (one per AZ for HA) for private-subnet egress, and route tables wiring it together. Know that security groups are stateful (return traffic is allowed automatically) and operate at the ENI level, while network ACLs are stateless (you must allow both directions) and operate at the subnet level — and that you almost always prefer security groups, reserving NACLs for coarse subnet-wide deny rules. A frequent gotcha: a single NAT gateway saves money but is an AZ-level SPOF and a real cost line item; VPC endpoints (gateway endpoints for S3/DynamoDB, interface endpoints for most other services) cut NAT data-processing charges and keep traffic off the public internet.

IAM least-privilege is where senior candidates separate themselves. The story you want to tell: start from deny-by-default, grant the narrowest action/resource/condition, prefer roles over long-lived access keys, and use IAM roles for EC2 (instance profiles) and IRSA/Pod Identity for pods so nothing carries static credentials. Know the difference between identity-based and resource-based policies, why iam:PassRole is dangerous (it lets a principal hand a more-privileged role to a service), and how permission boundaries and SCPs (in AWS Organizations) cap what even an admin can do. Mention IAM Access Analyzer to surface unintended external access and CloudTrail to audit who did what.

Secrets and encryption round it out. Secrets Manager (built-in rotation, cross-service integration) vs SSM Parameter Store (cheaper, SecureString backed by KMS) is a common compare. Be fluent on KMS: customer-managed keys vs AWS-managed keys, key policies vs IAM, envelope encryption, and encryption in transit (TLS, ACM-managed certs on an ALB) vs at rest (EBS, S3 SSE-KMS, RDS). The recurring theme: credentials should be short-lived, scoped, and auditable.

  • Security groups = stateful, ENI-level; NACLs = stateless, subnet-level.
  • NAT gateway per AZ for HA; add VPC endpoints to cut NAT cost and exposure.
  • Prefer IAM roles + IRSA/Pod Identity over static keys; treat iam:PassRole as high-risk.
  • Cap blast radius with permission boundaries and Organization SCPs.
  • Secrets Manager for rotation; Parameter Store SecureString for cheap KMS-backed config.

Infrastructure-as-code and CI/CD on AWS

Terraform vs CloudFormation is a guaranteed question. CloudFormation is AWS-native, deeply integrated (drift detection, StackSets across accounts/regions, automatic rollback on a failed update), and has no separate state to manage because AWS holds it. Terraform is multi-cloud, has a richer module ecosystem and HCL ergonomics, and — critically — manages its own state file, which you must put in a remote backend (S3) with locking so concurrent applies can't corrupt it. With Terraform 1.10+ you can use S3-native locking (use_lockfile); the older pattern used a DynamoDB table, now deprecated in favor of the lock file. Mention plan/apply discipline, terraform import for existing resources, and why you never change infrastructure by hand once it is under IaC (drift). CDK and Pulumi come up as 'IaC in a real language' alternatives.

For CI/CD, you should be able to design a pipeline end to end. A typical AWS-native flow: source (CodeConnections/GitHub) to build and test (CodeBuild) to artifact (ECR for images, S3 for bundles) to deploy (CodeDeploy/ECS/EKS) orchestrated by CodePipeline; many teams instead use GitHub Actions or GitLab CI as the orchestrator and call AWS via OIDC federation, so no long-lived AWS keys live in the runner. Know rollout strategies — rolling, blue/green (two target groups behind an ALB; shift traffic, keep the old fleet for instant rollback), and canary (shift a small percentage, watch alarms, then proceed) — and how CodeDeploy implements blue/green and canary for ECS, Lambda, and EC2.

The senior signal in CI/CD is safety and reproducibility: build an immutable artifact once and promote that same artifact across environments (build once, deploy many), scope pipeline permissions per stage, wire automatic rollback to CloudWatch alarms, and put infrastructure changes through the same plan-review-apply gate as code. For Kubernetes specifically, GitOps (Argo CD/Flux) makes cluster state a reconciled function of Git, so a rollback is a git revert and drift is auto-corrected.

  • Terraform: multi-cloud, own state — use an S3 remote backend with locking.
  • S3-native locking (use_lockfile, TF 1.10+) replaces the deprecated DynamoDB lock table.
  • CloudFormation: AWS-native, StackSets, no state to manage, auto-rollback.
  • Federate CI runners to AWS via OIDC — stop storing long-lived access keys.
  • Build once, promote the same immutable artifact; wire rollback to CloudWatch alarms.

Observability, cost, and scenario questions

Observability is the three pillars plus AWS plumbing: metrics (CloudWatch, custom metrics, EMF), logs (CloudWatch Logs, log groups, retention, subscription filters to OpenSearch/Kinesis), and traces (X-Ray, or OpenTelemetry via ADOT exporting to CloudWatch/managed Grafana). Know the difference between alerting on symptoms (latency, error rate — the SLO breach the user feels) versus causes, and be able to define an SLI, SLO, and error budget out loud. For EKS, mention Container Insights and the ADOT stack, and that you scrape Prometheus metrics and ship them to Amazon Managed Prometheus + Managed Grafana.

Cost questions are increasingly common because cloud bills hurt. Have a concrete toolkit: right-size with Compute Optimizer, cover the steady baseline with Savings Plans or Reserved Instances and run fault-tolerant or batch workloads on Spot, set S3 lifecycle policies and storage classes (Intelligent-Tiering, Glacier), kill idle NAT gateways and unattached EBS volumes and Elastic IPs, use VPC endpoints to dodge NAT data charges, and tag everything so Cost Explorer can attribute spend per team. The mature answer ties cost to architecture: the cheapest request is the one you never make, so caching and cutting cross-AZ chatter often beat raw right-sizing.

Scenario questions are the real differentiator: 'p99 latency just doubled, walk me through it' or 'a deploy went out and error rate spiked — what now?' Answer with a method, not a guess: stop the bleed first (roll back or scale out), establish a timeline from the deploy/change log, narrow from metrics to logs to traces, form a hypothesis, test it, and only then fix forward. Name the AWS levers — ASG/HPA to add capacity, target-group deregistration draining, CloudWatch alarms, the CodeDeploy rollback. Close with the follow-up: a blameless postmortem, an alert that would have caught it sooner, and an action item to prevent recurrence.

  • Alert on the SLO/symptom (latency, errors), not just CPU; know SLI/SLO/error budget.
  • Cost levers: Savings Plans/RIs + Spot, right-sizing, S3 lifecycle, kill idle NAT/EBS/EIP.
  • Incident method: stop the bleed, build a timeline, metrics then logs then traces, then fix.
  • Roll back before you debug forward when a recent change is the prime suspect.
  • Tag for cost attribution; finish incidents with a blameless postmortem and an action item.

Common interview questions & answers

ECS vs EKS — when would you choose each?

Choose ECS, especially Fargate, when the team is small, the workload is straightforward, and you want minimal operational overhead with deep native IAM via task roles — there is no control plane to upgrade and nothing below the task to manage. Choose EKS when you need Kubernetes primitives (CRDs, operators, Helm, GitOps with Argo CD/Flux), multi-cloud portability, or you already have Kubernetes expertise, accepting the cost of the hourly managed control plane plus owning upgrades, add-ons, and the VPC CNI. I frame it as a build-vs-operate trade: ECS minimizes what you operate, EKS maximizes what you can express.

What is IRSA, and how does it differ from EKS Pod Identity?

IRSA — IAM Roles for Service Accounts — lets a specific Kubernetes service account assume a scoped IAM role through the cluster's OIDC provider, so a pod gets short-lived, least-privilege AWS credentials instead of inheriting the node's instance-profile permissions. Without it, every pod on a node shares the node role, which violates least-privilege and widens blast radius. For IRSA you create an OIDC provider for the cluster, write a trust policy that scopes the role to a namespace and service account, and annotate the service account with the role ARN. EKS Pod Identity is the newer alternative that does the same job through an EKS agent and a simpler association API — no per-cluster OIDC provider and easier to reuse a role across clusters. Either way the goal is the same: each workload gets exactly the permissions it needs.

Explain the difference between security groups and network ACLs.

Security groups are stateful and operate at the ENI/instance level — if you allow inbound traffic, the return traffic is automatically permitted, and they support allow rules only. Network ACLs are stateless and operate at the subnet level — you must explicitly allow both inbound and outbound directions, and they support both allow and deny rules, evaluated in rule-number order. In practice I default to security groups for almost all access control and reserve NACLs for coarse, subnet-wide deny rules, for example blocking a known-bad CIDR. The classic bug is forgetting that a stateless NACL needs an outbound rule for the ephemeral return ports.

How do you manage Terraform state safely on a team?

Never use local state for shared infrastructure. Configure a remote backend — an S3 bucket with versioning enabled for the state file, and state locking so two concurrent applies can't corrupt it. On Terraform 1.10+ I use S3-native locking (use_lockfile), which relies on S3 conditional writes; the older approach used a DynamoDB lock table, now deprecated. Encrypt the bucket with KMS, restrict access with bucket and IAM policies, and split state by environment or service to limit blast radius. Then enforce a plan-review-apply workflow in CI so changes are reviewed as a plan before they touch real resources, and use terraform import rather than recreating anything that already exists.

Design a CI/CD pipeline to deploy a containerized service to EKS.

On a merge to main, CI builds and tests, then builds an immutable image tagged by commit SHA and pushes it to ECR. The CI runner authenticates to AWS via OIDC federation, so there are no stored access keys. For deploy I prefer GitOps: the pipeline bumps the image tag in a Git config repo, and Argo CD reconciles the cluster to that desired state, which makes rollback a git revert and auto-corrects drift. I gate progression environment-to-environment by promoting the same artifact, run a canary or blue/green rollout while watching CloudWatch or Prometheus alarms, and trigger an automatic rollback if the error budget burns too fast. Pipeline permissions are scoped per stage so a build stage can never touch production.

Why is iam:PassRole dangerous, and how do you contain it?

iam:PassRole lets a principal hand an IAM role to an AWS service — for example attaching a role to an EC2 instance, a Lambda function, or an ECS task. If a user can pass a role more privileged than themselves, that is a privilege-escalation path: they effectively gain those permissions through the service. You contain it by scoping the Resource of the PassRole permission to specific role ARNs, adding an iam:PassedToService condition so the role can only be handed to the intended service, and capping everything with permission boundaries and Organization SCPs. Auditing PassRole calls in CloudTrail and running IAM Access Analyzer rounds it out.

How does an Auto Scaling group keep a service healthy, and how do you scale it?

An ASG maintains desired capacity across multiple AZs using a launch template; if an instance fails its health check, the ASG terminates and replaces it. The health-check type matters — EC2 health checks only catch instance-level failure, so for a web service you set the type to ELB so an unhealthy target-group registration also triggers replacement. For scaling, target-tracking policies (for example, keep average CPU or ALB request-count-per-target at a target) are the cleanest; step scaling and scheduled scaling cover spiky or predictable patterns. I also use a mixed instances policy with Spot for cost, with capacity rebalancing so interruptions are handled gracefully ahead of the two-minute notice.

How do you cut a high AWS bill without hurting reliability?

Start with attribution: tag resources and use Cost Explorer to find the top line items, which are usually compute, NAT data processing, and storage. Right-size with Compute Optimizer, then cover the steady baseline with Savings Plans or Reserved Instances and run fault-tolerant or batch work on Spot. Add VPC endpoints for S3, DynamoDB, and high-traffic services to dodge NAT data charges, set S3 lifecycle policies and Intelligent-Tiering, and reap idle resources — unattached EBS volumes, idle NAT gateways, unused Elastic IPs. The architectural framing is that the cheapest request is the one you never make, so caching and reducing cross-AZ chatter often beat raw right-sizing — and none of this should touch your redundancy: keep multi-AZ and your reserved headroom intact.

p99 latency on a service just doubled after a deploy. Walk me through it.

First I stop the bleed: if the deploy is the obvious suspect, I roll it back or shift canary traffic away before debugging forward, because reducing user impact beats finding root cause. Then I build a timeline from the deploy and change logs to correlate the spike with a specific change. I narrow top-down — dashboards and metrics to localize which service or dependency, then logs for errors, then X-Ray traces to find the slow span (a new N+1 query, a saturated connection pool, a downstream throttle). Once I have a confirmed hypothesis I fix forward, then close with a blameless postmortem and an alert or test that would have caught it earlier.

How do you handle a region or AZ failure for a critical service?

At the AZ level the baseline is multi-AZ by design: subnets and an ASG or EKS node group spread across at least two AZs, a NAT gateway per AZ, and RDS Multi-AZ or Aurora with replicas so a single AZ loss is automatic and survivable. For a region failure you choose a strategy by RTO/RPO and budget — backup-and-restore is cheapest and slowest, pilot light keeps a minimal core warm, warm standby runs a scaled-down full stack, and active-active routes traffic to both regions via Route 53 with health checks. I make the data layer the hard part explicit: cross-region replication and a tested failover for the database, and I stress that DR is only real if you actually run game days against it.

What's the difference between Secrets Manager and SSM Parameter Store?

Both store configuration and can encrypt values with KMS, but Secrets Manager is purpose-built for secrets: native automatic rotation, tight integration with RDS and other services, and resource policies for cross-account sharing — at a higher per-secret cost. SSM Parameter Store is cheaper and great for general config; its SecureString type gives you KMS-backed encryption, and standard parameters are effectively free. I use Parameter Store for app config and non-rotating values, and Secrets Manager when I need built-in rotation or first-class secret features. Either way the app fetches at runtime via an IAM role — secrets never live in the image or in environment files committed to Git.

How do pods get networking on EKS, and what goes wrong at scale?

The AWS VPC CNI assigns each pod a real IP from the VPC subnet via secondary IPs on the node's ENIs, which gives native VPC networking and security-group reachability but means pods consume your subnet's IP space directly. At scale the classic failure is IP exhaustion — pods stop scheduling because the subnet ran out of addresses, or the node hits its ENI/IP-per-instance limit. Mitigations are prefix delegation (each ENI hands out /28 prefixes instead of single IPs), a larger or secondary CIDR for the pod subnets, and right-sizing instance types whose ENI limits match the pod density you want. It is worth knowing alternatives like Cilium exist if you need richer network policy or different IP semantics.

Practice this for real, from your target job

Reading about it only gets you so far. Paste a job description into prepme and get hands-on k3s/Terraform labs, auto-graded exams, and an architecture round — generated for that exact role and scored 0–100. Generating a briefing is free.

FAQ

How should I prepare for an AWS DevOps interview?+

Split prep into the four loops interviewers test — provision, ship, secure, operate — and make sure you can speak to each with a concrete story. Do hands-on work, not just reading: stand up a VPC, break and fix an EKS deployment, write a least-privilege IAM policy, and run a terraform apply against real resources so your answers come from muscle memory rather than memorization.

How much do I need to know about Kubernetes for an AWS DevOps role?+

If the role mentions EKS, expect Kubernetes to be a major part of the interview — pods, services, deployments, autoscaling, RBAC, and EKS specifics like IRSA and the VPC CNI. For ECS-only shops you can lean lighter on Kubernetes, but container fundamentals (images, networking, health checks, rollout strategies) are non-negotiable either way.

Are AWS certifications enough to pass the interview?+

Certifications help you cover breadth and signal baseline knowledge, but they rarely carry an interview on their own. The questions that decide the outcome are scenario- and judgment-based — debugging a latency spike, designing for AZ failure, scoping IAM — and those reward hands-on experience and clear reasoning far more than a certificate.

How do I answer scenario and troubleshooting questions well?+

Lead with a method instead of jumping to a guess: stop user impact first, build a timeline from recent changes, then narrow from metrics to logs to traces before forming and testing a hypothesis. Naming the specific AWS levers (rollback, autoscaling, alarms, draining) and closing with a prevention follow-up is what makes the answer sound senior.

What mistakes make candidates fail AWS DevOps interviews?+

The common ones are reciting service definitions without trade-offs, ignoring failure modes and blast radius, hand-waving on IAM and networking specifics, and never mentioning cost. The fix is to volunteer the gotcha and the unhappy path in every design answer — that is the signal interviewers are actually listening for.

Can I practice realistic AWS DevOps scenarios before the interview?+

Yes. On prepme.io you paste a real AWS DevOps job description and get a free briefing tailored to that exact stack — hands-on Kubernetes and Terraform labs you debug in the browser, auto-graded ABCD exams, and an architecture design round, each scored 0–100 with specific feedback. Generating and viewing the briefing (plus free AI Company Coverage research on the employer) is free; the hands-on practice is $20/month and you can cancel anytime. Practicing on scenarios drawn from the actual role beats grinding a generic question bank.

Related guides