ApplyJob Type
Full-timeDescription
Reliability and Observability Lead Engineer
Who we are:
Cognosos is at the forefront of integrating AI and ML into networked devices to provide solutions that range from protecting hospital staff to smoothing logistics at major automobile manufacturers. Our best-in-class solutions are lightweight, thoughtfully engineered, and backed by a team that takes pride in the work we do. Our hard work and innovation have been recognized on the Inc 5000 and the Atlanta Pacesetters list of fastest growing private companies, as well as the Technology Association of Georgia’s Top 40 Innovative Companies, Merit Gold Award, AutoTech Breakthrough, and the list goes on.
Learn more about Cognosos's mission at www.cognosos.com.Cognosos is seeking a Reliability and Observability Lead Engineer responsible for the overall monitoring and observability of our products. The right candidate is obsessed with product reliability and quality improvements and has experience identifying critical product metrics, defining processes, building information-dense dashboards, and translating them into actionable alerts.If you’re looking for a highly challenging position with the opportunity to advance your technology career in the areas of cloud IOT, machine learning, cloud computing and security, then look closely at this position.
Responsibilities
- Work with the executive team in to define SLOs for the Cognosos platform and service
- Define relevant SLIs that support those SLOs.
- Build processes and tools to monitor uptime, and measure the SLA compliance of the overall service, as well as individual components (Hardware, Network, Software)
- Provide dashboards and periodic reports for System performance and availability
- Work with the hardware and platform engineering teams, and the field services team to define and implement notifications, alarms, and escalation procedures
Qualifications:
- 5+ years site reliability engineering experience preferably in a hardware/software company
- Background in statistical quality control techniques
- Experience defining, implementing, and advocating for platform observability objectives
- Experience with AWS, Prometheus, Grafana, Python, and MySQL required.
- Familiarity with ElasticSearch and Kibana
Preferred Skills
- Experience with New Relic or similar APM platform
- Experience with OpsGenie or similar
- Experience with hardware and/or mobile app monitoring
- Knowledge of Software Development Life Cycle and Agile methodologies
- Strong leadership skills and experience leading and mentoring a team
Benefits and Perks:
We are pleased to offer:
- Competitive salaries
- Unlimited vacation so you can rest and recharge
- Full benefits program (Health, Dental, Vision, 401(k), life and disability insurance)
- Paid parking at our Atlanta office
- Opportunity for equity participation
- Volunteer opportunities
- Weekly catered lunches
Whether it’s virtual happy hours, company-wide contests or quarterly cultural outings, we are always looking for ways to keep our employees happy and engaged with their teammates.