DevOps Engineer (First DevOps Hire)

Vetric · senior

Hands-on labs

Terragrunt Multi-Env EKS Bootstrapeasy

Stand up a Terragrunt-wrapped Terraform layout that provisions a minimal EKS cluster config across dev/staging/prod with shared modules. Focus on DRY structure, remote-state separation, and IAM boundaries — no actual apply needed, `plan` must succeed cleanly.

Enter lab →

KEDA Autoscaling for an SQS-Driven Scraper Workermedium

You have a Kubernetes Deployment that consumes scrape jobs from an SQS queue. Install KEDA on a local kind cluster, wire a ScaledObject that scales workers from 0→50 based on queue depth, and validate behavior under a synthetic burst. Includes Prometheus scraping KEDA metrics for visibility.

Enter lab →

GitOps Delivery with ArgoCD + Progressive Rollouthard

Build an end-to-end GitOps pipeline: GitHub Actions builds and pushes an image, writes the new tag to a Helm values file in an infra repo, and ArgoCD auto-syncs to a kind cluster with a canary rollout via Argo Rollouts. You'll debug a broken sync and add a failure-based auto-rollback.

Enter lab →

Design interviews

3 JD-grounded scenarios

easy

Design Vetric's baseline AWS landing zone and EKS platform that all future scraping/data services will run on, given this is the first DevOps hire walking into an unmanaged environment.

As the first DevOps hire, Day-1 decisions on account/IaC/EKS layout lock in the next several years — JD explicitly calls this out.

AWS-only (JD specifies 5+ years AWS, EKS-centric)inferred: 3 environments (dev/staging/prod) across 2 AWS accounts minimuminferred: team <30 engineers today, growingServe customers in 20+ countries — assume us-east-1 primary, eu-west-1 secondary+2

Open board →

medium

Design the autoscaling and scheduling strategy for Vetric's scraper worker fleet on EKS: workloads are bursty SQS/Kafka consumers hitting external targets, with per-customer isolation and strict egress-IP requirements.

Vetric's core product IS large-scale scraping pipelines — this is the single most load-bearing system the DevOps hire will own.

inferred: 10k–100k concurrent scrape workers at peak, idle near-zero off-peakinferred: mixed job durations from seconds to hoursMust use KEDA or similar event-driven autoscaler (JD bonus)Per-customer egress IP pools (inferred from 'public data infrastructure' domain)+2

Open board →

hard

Design an end-to-end observability + incident response platform for Vetric's data pipelines so a single on-call engineer can detect, triage, and resolve customer-impacting regressions across billions of records/day without drowning in alerts.

JD makes observability leadership a top responsibility and the customer base (public safety, cyber) means reliability engineering maturity is existential, not optional.

inferred: billions of records processed daily across 20+ countriesStack constrained to Prometheus/Grafana/ELK/CloudWatch family (JD)SLAs tied to mission-critical customers in cybersecurity/public safety — false negatives cost livesSmall team — one DevOps, handful of backend engineers on rotation+2

Open board →

Troubleshooting drills

2 scenarios — run them as interactive practice

medium

An ArgoCD sync is stuck in 'Progressing' for 20 minutes on a production app. How do you diagnose it?

Run drill →

medium

An EKS node group is seeing random pod OOMKills at ~70% memory utilization. How do you investigate?

Run drill →

Stack

15 mentioned · 4 inferred

AWS (EKS, EC2, IAM, VPC)KubernetesTerraformOpenTofuTerragruntGitHub ActionsJenkinsArgoCDPrometheusGrafanaELK StackCloudWatchBash / Python / JavaScriptAmazon ECSKEDAKafka or SQS (event-driven)HelmDockerS3 / data lake storage

Likely questions

architecturehard
Walk me through how you'd design multi-tenant EKS clusters for data-heavy scraping workloads that burst from 10 to 500 worker pods within minutes.
Practice
systemsmedium
How would you structure a Terraform + Terragrunt monorepo to manage dev/staging/prod across multiple AWS accounts with shared modules?
Practice
troubleshootingmedium
An ArgoCD sync is stuck in 'Progressing' for 20 minutes on a production app. How do you diagnose it?
Practice
behavioralmedium
You're the first DevOps hire. What are your first 30/60/90 days and how do you prioritize when everything needs attention?
Practice
architecturehard
Design a KEDA-based autoscaler for a Kafka/SQS-consumer pipeline that processes bursty URL-scrape jobs with strict cost ceilings.
Practice
securitymedium
What's your approach to IAM least-privilege when giving EKS pods access to S3 and DynamoDB — IRSA vs node roles vs Pod Identity?
Practice
scriptingeasy
Write a Bash or Python script that rotates IAM access keys older than 90 days across multiple AWS accounts and posts results to Slack.
Practice
architecturehard
How would you design end-to-end observability (metrics, logs, traces) for a pipeline ingesting billions of web records per day while controlling Prometheus cardinality?
Practice
troubleshootingmedium
An EKS node group is seeing random pod OOMKills at ~70% memory utilization. How do you investigate?
Practice
behavioraleasy
How do you convince engineering leadership to adopt a standard (e.g., GitOps via ArgoCD) when teams already ship via ad-hoc scripts?
Practice

Culture

· Bootstrapped, profitable — expect pragmatism and long-term thinking over hype-chasing.
· First DevOps hire with full technical authority — high ownership, low guardrails, you define the playbook.
· Small but sharp team — broad scope, expect to context-switch between IaC, clusters, pipelines, and observability.
· Mission-critical customers (cybersecurity, public safety) — reliability and production discipline are non-negotiable.
· Globally-serving infra across 20+ countries — reliability, scale, and SLAs matter from day one.
· Engineering-led culture that values 'discipline' and 'quality' — expect rigorous review and thoughtful architectural debate.

From the bank

4 for this stack

We run Terraform across ~40 modules. What's your strategy for dependency ordering, state isolation, and avoiding drift?
Write a Bash one-liner that tails every pod's logs in a namespace and highlights any line containing 'ERROR' or '5xx'.
Tell me about a time a deploy went wrong in production. What was the blast radius, how did you recover, and what did you change afterwards?

Browse all →

Original job description

DevOps Engineer
Engineering Tel Aviv, IsraelSeniorFull-time
Description
What is Vetric?

Vetric builds large-scale public data infrastructure.

We provide data pipelines that collect, structure, and deliver high-volume public web data for mission-critical companies operating in cybersecurity, public safety and digital risk protection.

Our systems power platforms that detect bad actors, uncover impersonation and fraud, identify coordinated manipulation, and help public safety organizations respond faster to real-world risks.

We don’t build dashboards, and we don’t sell surface-level insights.

We build stable, production-grade data flows that become part of our customers’ core products, with the real impact of saving lives or huge known organizations from bad actors.

Operating globally, we serve industry leaders across more than 20 countries who rely on us for scale, reliability, and depth.



Why Vetric?

Vetric is profitable from day one (fully bootstrapped - we haven’t raised external funding), and we’re building foundational technology - not chasing trends. Because this is infrastructure that matters, we operate with engineering discipline, strong ownership, and long-term thinking.

We’re at a true inflection point: the team is now large enough to require real infrastructure, yet still small enough that what you build will define how things work for the next several years.

This is infrastructure that matters and so is how we operate internally. You’ll be working with a sharp, focused team that takes engineering discipline seriously and is intentionally building an organization that matches the quality of its product.



Position Overview

We are seeking a DevOps Engineer to lead and own the entire DevOps function at Vetric. 

As our first DevOps hire, you won’t just maintain systems, you will set the vision, establish best practices, and build the foundation of our infrastructure strategy for years to come. This is a unique opportunity to step into an impactful role with full technical authority, influencing architectural decisions and guiding how our engineering teams deliver, scale, and secure our large-scale, data-intensive platform.



Responsibilities:

Define and drive Vetric’s infrastructure strategy across all environments
Architect and operate Kubernetes clusters at production scale with a focus on reliability, resilience, and data-heavy workloads
Lead the adoption of Infrastructure as Code (Terraform, OpenTofu, Terragrunt) and establish automation standards
Implement modern CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD, or similar)
Champion observability, monitoring, and reliability engineering practices
Build and optimize infrastructure that powers large-scale, data-driven pipelines at massive scale
Serve as the technical authority for all DevOps matters, influencing and aligning engineering teams
Partner with engineering leadership to shape infrastructure roadmaps and technology choices
Requirements
Qualifications:

5+ years of deep, hands-on AWS experience (EKS, EC2, networking, IAM, scaling)
Proven success in senior DevOps / Cloud Engineering leadership roles
Expert knowledge of Terraform and modern IaC tools (OpenTofu, Terragrunt)
Strong Kubernetes expertise at scale (design, scaling, optimization)
Experience running high-scale, production-grade environments handling large data volumes
Excellent communication skills with the ability to influence, guide, and align teams
Solid scripting/automation skills (Bash, JavaScript, Python, or similar)
Familiarity with cloud-native monitoring & logging stacks (Prometheus, Grafana, ELK, CloudWatch, etc.)


We’d be lucky if you:

Experience with Amazon ECS
Proficiency with GitHub or similar platforms (GitLab, Bitbucket, etc.)
Exposure to event-driven architectures and autoscaling frameworks (KEDA or similar)