Regional Site Reliability Engineer (SRE)  

Position Responsibilities

Production Reliability & Availability

  • Ensure high availability and performance of production services across multiple regions
  • Define and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets
  • Lead incident response and root cause analysis (RCA) for production issues
  • Improve system resilience through fault tolerance, redundancy, and graceful degradation

Cloud Infrastructure & Platform Operations
  • Operate and optimize containerized services running on AWS ECS
  • Manage cloud infrastructure including:
    • AWS ECS
    • Application Load Balancers
    • Auto Scaling
    • CloudWatch
    • VPC networking
  • Ensure reliable deployment pipelines and infrastructure consistency

Microservices Reliability Engineering
  • Support and optimize Go and Node.js microservices
  • Improve service performance, scalability, and fault tolerance
  • Implement health checks, circuit breakers, and retry strategies
  • Collaborate with development teams to improve service architecture

Observability & Monitoring
  • Implement and maintain observability systems including
    • Metrics
    • Logging
    • Distributed tracing
  • Build dashboards and alerts to detect system issues early
  • Improve monitoring using tools such as:
    • Prometheus / Grafana
    • AWS CloudWatch
    • OpenTelemetry

CI/CD Automation
  • Build and maintain CI/CD pipelines for microservices
  • Automate infrastructure and operational tasks using:
    • Infrastructure as Code (Terraform / CloudFormation)
    • Scripts or internal tooling
  • Improve deployment reliability and reduce manual intervention

Incident Management & Operational Excellence
  • Participate in on-call rotations
  • Drive blameless postmortems
  • Implement preventive actions to eliminate recurring incidents
  • Continuously improve operational runbooks and response processes

Qualification & Experiences
  • 4+ years experience in Site Reliability Engineering, DevOps, or Production Engineering
  • Experience supporting distributed microservices architecture
  • Experience operating high-traffic production systems

Technical Skills:

Cloud & Infrastructure Management
  • Primary Platform: Extensive experience with AWS ecosystem management.
  • Compute & Orchestration: Hands-on expertise in AWS ECS (Fargate & EC2 launch types) and Docker containerization.
  • Networking: Proficient in VPC configuration, Application Load Balancers (ALB), and Network Load Balancers (NLB).
  • Scaling: Experienced in managing Auto Scaling to maintain high-traffic production environments.
Programming & Automation
  • Application Support: Proven ability to support and optimize distributed microservices written in Go (Golang) and Node.js.
  • Automation & Scripting: Skilled in developing internal tools and automation scripts using Go, Python, and Bash.
  • Infrastructure as Code (IaC): Experienced in automating infrastructure using Terraform or CloudFormation.
Additional:
  • Experience managing multi-region infrastructure
  • Experience operating high-scale microservices systems
  • Knowledge of service mesh architectures
  • Experience with AWS ECR, Lambda, or event-driven architecture
  • Experience with cost optimization in AWS

a Necessity, not a Luxury