As a
Site Reliability Engineer (SRE)
, you will maintain and improve our platform's reliability, availability, and performance, leveraging Azure as the core cloud platform and using industry leading tools. You will work closely with cross-functional teams to design, implement, and maintain resilient systems, automating wherever possible to streamline operations and minimize downtime. Your expertise will be instrumental in proactively identifying and resolving potential issues before they impact our customers, and you will contribute to the continuous improvement of our infrastructure and processes.
Key Responsibilities:
- Analyze reliability challenges and develop automated solutions for incident resolution.
- Work with development teams to improve applications operational features for faster MTTD, MTTR, and auto-recovery.
- Lead the establishment of SLIs, SLOs, Error budgets, policies, and work with respective engineers to instrument, visualize, and offer a means for peer engineers and developers to gain greater insight into operational performance (Observability)
- Identify, track, and address Toil.
- Conduct Post-Mortems
- Identify and implement continuous improvement in various facets of production operations.
- Offer advanced technical support for cross-product issues and incidents.
- Leveraging SRE tooling to develop, implement, and deliver on the SRE mission.
- Conduct Chaos Testing
- Identify, define, and implement new tools and technologies to improve the quality and efficiency of distributed platforms.
- You will drive reliability and supportability aspects of Cloud service, including change management, triage of customer escalations, remediation plans, playbooks, and automation.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement.
Qualifications:
- 4+ years of experience in Reliability engineering background
- 2+ recent years of experience with Azure systems
- Advanced knowledge of New Relic ecosystem.
- Working Knowledge of Monitoring and APM tools such as Azure App Insights, Grafana, and Selenium
- Knowledge of networking and troubleshooting latency, connectivity, and performance
- Experience working with IaC with Terraform and CaC with Ansible.
- Familiar with one or more Databases - SQL server, Mongo DB, and PostgreSQL
- Hands-on experience with SRE practices and writing, running Chaos engineering experiments.
- Preferred experience with C#, .Net, and PowerShell or Python or Golang
- Experience with containerization.
- Experience in High Availability and distributed systems.
- Proficient in Linux and Windows administration, troubleshooting, and support
- Experience with Azure DevOps
- Excellent Debugging skills across a variety of integrated platforms.
StarCompliance Background Checks
All positions require pre-employment screening due to employees potentially having access to highly sensitive and confidential information involving finance and compliance; candidates must be trustworthy and have a heightened sensitivity to protecting confidential financial, professional information. To be eligible for employment with StarCompliance, candidates must undergo a rigorous background investigation with checks including, but not limited to, criminal record history, consumer credit, employment history, qualifications, and education checks.
Equal Opportunity Employer Statement
We prohibit discrimination and harassment of any kind based on race, sex, religion, sexual orientation, national origin, disability, genetic information, pregnancy, gender identity or expression, marital/civil union/domestic partnership status, veteran status or any other protected characteristic as outlined by country, state, or local laws.This policy applies to all employment practices within our organisation, including hiring, recruiting, promotion, termination, layoff, recall, leave of absence, compensation, benefits, training, and apprenticeship.
StarCompliance makes hiring decisions based solely on qualifications, merit, and business needs at the time. For more information, please request a copy of our Equal Opportunities Policy.Apply for this job