Advertisement
Job Description:
Your Role
As an Site Reliability Engineer-I, you will be responsible for bringing a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership. These services are offered as Managed Service/SaaS and hence total ownership of the solution, securing, keeping it always up and running remains with us. Being part of a critical healthcare application and keeping it up and running 24x7/365 is very critical and stakes are high. We are seeking an experienced engineering team member who brings a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership.
A Day in the Life
In this role you will be responsible for various pillars of SRE - Deployment, Reliability, Scalability, Service Availability - SLA/SLO/SLI, Performance, Cost etc.
Lead production roll out of new releases/emergency patches using CICD pipelines and constantly improving pipelines. Establish a solid production promotion/change management process with a solid quality gate working across Dev/QA teams.
Roll out a solid observability stack across various components of the tech stack so as to proactively detect outage v/s service degradation before the customer notifies us.
Apply strong analytical skills to understand production system metrics, drive change, optimize system utilization and drive cost efficiency.
Autoscale/down the platform during peak season scenarios.
Understand end to end platform architecture and how to best and fast perform triage/RCA by looking at various data points derived from observability tool chain.
You will work towards reducing the number of alerts/escalation to the next level team – dev/devops.
You will be part of the 24x7 OnCall Production Support team.
Lead monthly operations review with the executive team. Some examples include, but are not limited to – Platform/Application/Infrastructure KPIs - UpTime, RCA , CAP (Corrective Action Plan) and PAP (Preventive Action Plan), security reports,audit reports.
You will be responsible for Operating and Managing production and staging cloud platforms, responsible for Ops (executing/automation runbook/SOP/ Maintain up-time/SLA) as well as Site Reliability engineering.
Collaboration is key to this role so as to work across a spectrum of teams - Dev/DevOps/QA/Customer Success etc. derive RCA/5 why analysis and drive product improvements.
Ensure that the Platform is secured as per guidelines established by CISO. e,g, Secure against DDoS attacks by implementing WAF, Vulnerability and Patch management, install required security agents etc.
Lead least privilege based RBAC for various production services and tool chain.
Build and execute Disaster Recovery plan.
Key stakeholder to participate incase of IR (Incident Response).
What You Need
Technical
Solid experience with at least one of the clouds with automation focus is MUST - AWS, Azure, GCP. Certification has advantages.
Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation.
Log/Metrics/Tracing tool chain experience is MUST to have; strong analytical skills to understand various data points to understand platform behavior/RCA.
Hands-on experience with Kubernetes along with Linux is MUST to have.
Programming experience with scripting languages e.g. Python is MUST.
Must be good at documenting and structuring documents be it process or RCA.
Experience working in a 24x7 Production environment with process focus is preferred.
Ticketing system, Incident management experience is preferred.
Security background and security first approach mindset is preferred.
Experience with CICD pipelines and tool chains is preferred.
Hands-on experience with a few of these - Kafka,Postgre, SnowFlake etc. is preferred.
Personality Trait
Must be able to perform with cool head under pressure situations without taking any shortcuts during production issues.
Collaboration with solid verbal and oral communication skills are very critical to this role. Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts.
Strong cross-functional collaboration skills, relationship building skills, and ability to achieve results without direct reporting relationships
Ability to quickly identify and drive to the optimal solution when presented with a series of constraints.
Excellent judgment, analytical thinking, and problem-solving skills.
Self-motivated individual that possesses excellent time management and organizational skills.
Strong sense of personal responsibility and accountability for delivering high quality work.
Preferred Skills:
MultiCloud - AWS, Azure, GCP
Distributed Compute - Kubernetes (EKS/AKS), Containerization
Persistence stores - Postgres, MongoDB
DataWarehousing - Snowflake, DataBricks
Messaging - Kafka
CICD - Jenkins, ArgoCD, GitOps
Observability - ElasticSearch, Prometheus, Jaeger, NewRelic etc.
About Company: