Welcome | Zus Coffee

Regional Site Reliability Engineer (SRE)

Position Responsibilities

1. Production Reliability & Availability

Ensure high availability and performance of production services across multiple regions
Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets
Lead incident response and root cause analysis (RCA) for production issues
Improve system resilience through fault tolerance, redundancy, and graceful degradation

2. Cloud Infrastructure & Platform Operations

Operate and optimize containerized services running on AWS ECS
Manage cloud infrastructure including:

AWS ECS
Application Load Balancers
Auto Scaling
CloudWatch
VPC networking

Ensure reliable deployment pipelines and infrastructure consistency

3. Microservices Reliability Engineering

Support and optimize Go and Node.js microservices
Improve service performance, scalability, and fault tolerance
Implement health checks, circuit breakers, and retry strategies
Collaborate with development teams to improve service architecture

4. Observability & Monitoring

Implement and maintain observability systems including

Metrics
Logging
Distributed tracing

Build dashboards and alerts to detect system issues early
Improve monitoring using tools such as:

Prometheus / Grafana
AWS CloudWatch
OpenTelemetry

5. CI/CD Automation

Build and maintain CI/CD pipelines for microservices
Automate infrastructure and operational tasks using:

Infrastructure as Code (Terraform / CloudFormation)
Scripts or internal tooling

Improve deployment reliability and reduce manual intervention

6. Incident Management & Operational Excellence

Participate in on-call rotations
Drive blameless postmortems
Implement preventive actions to eliminate recurring incidents
Continuously improve operational runbooks and response processes

Qualification & Experience

4+ years experience in Site Reliability Engineering, DevOps, or Production Engineering
Experience supporting distributed microservices architecture
Experience operating high-traffic production systems

Technical Skills

1. Cloud & Infrastructure Management

Primary Platform: Extensive experience with AWS ecosystem management.
Compute & Orchestration: Hands-on expertise in AWS ECS (Fargate & EC2 launch types) and Docker containerization.
Networking: Proficient in VPC configuration, Application Load Balancers (ALB), and Network Load Balancers (NLB).
Scaling: Experienced in managing Auto Scaling to maintain high-traffic production environments.

2. Programming & Automation

Application Support: Proven ability to support and optimize distributed microservices written in Go (Golang) and Node.js.
Automation & Scripting: Skilled in developing internal tools and automation scripts using Go, Python, and Bash.
Infrastructure as Code (IaC): Experienced in automating infrastructure using Terraform or CloudFormation.

Additional:

Experience managing multi-region infrastructure
Experience operating high-scale microservices systems
Knowledge of service mesh architectures
Experience with AWS ECR, Lambda, or event-driven architecture
Experience with cost optimization in AWS

a Necessity, not a Luxury