Back to Job Board
W

SRE

Wits Innovation LabTS, IndiaApril 17, 2026

Job Description

About the Role

We are looking for a highly skilled site reliability engineer to manage and scale our on-premise payments infrastructure. You will work on onsite environment spanning virtual machines and containerized workloads on bare metal, ensuring high availability, security, and performance for mission-critical systems.

Key Responsibilities

  • Operate and optimize virtualized environments (VMs) and containerized workloads (Docker on bare metal)
  • Manage and scale middleware systems like:
  • Nginx (traffic routing, reverse proxy, load balancing)
  • Redis (caching, HA setup)
  • Kafka (streaming, partitioning, fault tolerance)
  • Build and maintain CI/CD pipelines using Jenkins
  • Manage infrastructure and application configurations using Git-based version control
  • Ensure high availability, resilience, and performance tuning across systems
  • Work on Linux system administration (RHEL/CentOS/Ubuntu)
  • Implement and maintain automation frameworks using:
  • Ansible
  • Shell scripting
  • Manage and troubleshoot networking components:
  • TCP/IP, DNS, Load balancing
  • Firewalls, WAF policies
  • Akamai
  • Handle security and compliance requirements
  • Maintain accurate inventory and asset management systems
  • Participate in incident response, RCA, and system reliability improvements
  • Collaborate with application, security, and DevOps teams

Required Skills & Qualifications

Core Infrastructure

  • Strong hands-on experience with Linux system administration
  • Experience managing on-prem data center environments
  • Solid understanding of:
  • Virtualization (VMware / KVM or similar)
  • Bare metal provisioning

Containers & Middleware

  • Experience running Docker in production (non-Kubernetes setups preferred)
  • Strong operational knowledge of:
  • Nginx
  • Redis
  • Kafka
  • RDBMS
  • Java

Observability, Alerting & Reliability

· Design and manage observability platforms:

o Elastic Stack (ELK)

o Grafana / Prometheus stack

· Build and maintain:

o Metrics, logs, and tracing pipelines

o Dashboards for system health and business KPIs

· Develop intelligent alerting strategies:

o Reduce noise (alert fatigue)

o Improve signal quality

· Build correlation mechanisms / alert aggregation systems to:

o Reduce MTTD (Mean Time to Detect)

o Reduce MTTR (Mean Time to Recover)

· Drive proactive monitoring and anomaly detection

· Lead incident response, debugging, and RCA with data-driven insights

CI/CD & Version Control

  • Hands-on experience with:
  • Git (branching strategies, code reviews, infra-as-code workflows)
  • Jenkins (pipeline creation, build automation, deployment orchestration)

Networking & Security

  • Good understanding of:
  • Networking fundamentals (L3/L4 concepts)
  • Firewalls and WAF (rule tuning, debugging)
  • Experience handling secure production environments

Automation

  • Hands-on experience with:
  • Ansible
  • Shell scripting (bash)

Operations

  • Experience with:
  • Monitoring, alerting, and logging systems
  • Incident management & RCA
  • Capacity planning

Preferred Qualifications (Good to Have)

  • Experience in UPI / Payments domain
  • Understanding of:
  • High TPS systems
  • Low latency architecture
  • Exposure to:
  • Ceph / SAN / storage systems
  • HA/DR design patterns
  • Knowledge of observability stacks (Prometheus, ELK, etc.)
  • Experience working in regulated environments (PCI-DSS, RBI guidelines)

Pay: ₹600,000.00 - ₹1,700,000.00 per year

Work Location: In person

Preparing for this role?

Practice with an AI interviewer tailored to SRE at Wits Innovation Lab.