Back to Job Board

W

SRE

Wits Innovation LabTS, IndiaApril 17, 2026

Job Description

About the Role

We are looking for a highly skilled site reliability engineer to manage and scale our on-premise payments infrastructure. You will work on onsite environment spanning virtual machines and containerized workloads on bare metal, ensuring high availability, security, and performance for mission-critical systems.

Key Responsibilities

Operate and optimize virtualized environments (VMs) and containerized workloads (Docker on bare metal)
Manage and scale middleware systems like:
Nginx (traffic routing, reverse proxy, load balancing)
Redis (caching, HA setup)
Kafka (streaming, partitioning, fault tolerance)
Build and maintain CI/CD pipelines using Jenkins
Manage infrastructure and application configurations using Git-based version control
Ensure high availability, resilience, and performance tuning across systems
Work on Linux system administration (RHEL/CentOS/Ubuntu)
Implement and maintain automation frameworks using:
Ansible
Shell scripting
Manage and troubleshoot networking components:
TCP/IP, DNS, Load balancing
Firewalls, WAF policies
Akamai
Handle security and compliance requirements
Maintain accurate inventory and asset management systems
Participate in incident response, RCA, and system reliability improvements
Collaborate with application, security, and DevOps teams

Required Skills & Qualifications

Core Infrastructure

Strong hands-on experience with Linux system administration
Experience managing on-prem data center environments
Solid understanding of:
Virtualization (VMware / KVM or similar)
Bare metal provisioning

Containers & Middleware

Experience running Docker in production (non-Kubernetes setups preferred)
Strong operational knowledge of:
Nginx
Redis
Kafka
RDBMS
Java

Observability, Alerting & Reliability

· Design and manage observability platforms:

o Elastic Stack (ELK)

o Grafana / Prometheus stack

· Build and maintain:

o Metrics, logs, and tracing pipelines

o Dashboards for system health and business KPIs

· Develop intelligent alerting strategies:

o Reduce noise (alert fatigue)

o Improve signal quality

· Build correlation mechanisms / alert aggregation systems to:

o Reduce MTTD (Mean Time to Detect)

o Reduce MTTR (Mean Time to Recover)

· Drive proactive monitoring and anomaly detection

· Lead incident response, debugging, and RCA with data-driven insights

CI/CD & Version Control

Hands-on experience with:
Git (branching strategies, code reviews, infra-as-code workflows)
Jenkins (pipeline creation, build automation, deployment orchestration)

Networking & Security

Good understanding of:
Networking fundamentals (L3/L4 concepts)
Firewalls and WAF (rule tuning, debugging)
Experience handling secure production environments

Automation

Hands-on experience with:
Ansible
Shell scripting (bash)

Operations

Experience with:
Monitoring, alerting, and logging systems
Incident management & RCA
Capacity planning

Preferred Qualifications (Good to Have)

Experience in UPI / Payments domain
Understanding of:
High TPS systems
Low latency architecture
Exposure to:
Ceph / SAN / storage systems
HA/DR design patterns
Knowledge of observability stacks (Prometheus, ELK, etc.)
Experience working in regulated environments (PCI-DSS, RBI guidelines)

Pay: ₹600,000.00 - ₹1,700,000.00 per year

Work Location: In person

via

View original post

Preparing for this role?

Practice with an AI interviewer tailored to SRE at Wits Innovation Lab.

More Jobs

Growth Hacker

Carlofty·Valentine

NigeriaApply

Staff Attorney II

Compositor (Flame / Nuke)

Black Kite·Animation and VFX Jobs