logo inner

SRE specialized in High Performance Computing/AI

ScalewayParis, France | LilleHybrid, Onsite
This job is no longer open
Fondée en 1999, Scaleway est la filiale cloud du groupe Iliad, l’un des leaders des télécommunications en Europe. Notre mission est de favoriser une industrie numérique plus responsable en aidant les développeurs et les entreprises à créer, déployer et adapter des applications à n'importe quelle infrastructure.
Depuis nos bureaux situés à Paris et à Lille, nous perfectionnons quotidiennement l'écosystème cloud de Scaleway, dont nous sommes les premiers utilisateurs.Nos quelques 25 000 clients nous choisissent pour notre redondance multi-AZ, notre expérience-utilisateur fluide, nos datacenters neutres en carbone ainsi que nos outils natifs de gestion d'architectures multi-cloud. Nos produits incluent des solutions entièrement gérées pour le bare metal, la conteneurisation et les architectures serverless, offrant ainsi un choix responsable dans le domaine du cloud computing.Rejoignez notre équipe dynamique de près de 600 collaborateurs venant de divers horizons, dans un environnement stimulant et international alliant excellence technique, créativité et partage.

About the job


With teraflops of computing power available for Scaleway customers, we are looking for a SRE to join our new team specialized in HPC (High Performance Computing). We are deploying several clusters, one single cluster can be part of the top 15 of HPC listed in the Top500 (https://www.top500.org).Reporting to our Engineering Manager Emerick Mounoury, you will be responsible to ensure the deployment and the health of the components of our multiple HPC clusters composed of Nvidia hardware.We expect you to have a strong background in HPC environment and system administration, along with some DevOps experience and SRE best practices.Our systems evolve constantly and the tools we use to monitor and ensure their resilience need to evolve accordingly.

Minimum qualifications


  • Experience in system programming using at least one of these languages:Python, Bash, Go, etc.
  • Demonstrated ability to troubleshoot production system failures
  • A positive mindset and desire to work with a team
  • Passion for automation and incremental improvements on tooling, 
  • Experience with Linux systems: based on Debian and Centos derivatives
  • Experience with batch job schedulers like Slurm, OAR, SGE
  • Good understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, …
  • Storage knowledge: large pools, NAS, S3, ..
  • Experience with Nvidia, Cuda, MPI
  • Good command of English

Preferred qualifications


  • Ability to meticulously identify and solve any kind of bug in any codebase.
  • Experience with infrastructure-as-code and continuous deployment
  • Experience dealing with physical hardware automation
  • Experience monitoring & logging systems
  • Experience handling account management (LDAP)
  • Knowledge of at least one cloud platform and related use-cases
  • Experience as an OSS contributor and/or maintainer
  • Knowledge in AI / LLM / ML / neuronal networks

Responsibilities


  • Create or optimize existing tools & documentation that will help identify, diagnose, and solve production incidents, automating as much as possible
  • Troubleshoot high-impact issues by working with multiple Engineering teams (Storage, Network, Hardware)
  • Take on-call responsibilities, mitigate issues encountered in production and answer our customers in real time
  • Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
  • Manage the life cycle of HPC clusters in production and take part to the escalation of the hardware and software issues to our suppliers
  • Empower your teammates to swiftly integrate and deploy software components across our systems
  • Help implementing best stability, resiliency, scalability, security, and performance practices across our systems

Technical Stack


  • Python/Bash
  • MySQL
  • S3 API, Lustre, NAS
  • Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
  • Ansible, Salt
  • GitLab, Nexus
  • Ubuntu, Debian, CentOS
  • Nvidia hardware and software
  • MPI, Module, AI software
  • Slurm
  • K8s
  • Jira, Confluence, Slack, GSuite

Location


This position is based in our offices in Paris or Lille (France)

Recruitment Process  


Screening call - 30 mins with the recruiter Manager Interview - 45 minsTechnical Interviews 1h 30 minsHR Interview - 45 minsOffer sent - 48 hoursSi vous ne vous voyez pas cocher toutes les cases, n'hésitez pas à postuler tout de même. Ne vous limitez pas à une description de poste - on ne sait jamais !🌐Scaleway | Scaleway Blog| Scaleway sur XApply for this job

This job is no longer open

Life at Scaleway

Thrive Here & What We Value* Dynamic, diverse team culture* International excellence, creativity, sharing* Multi-AZ redundancy, user-friendly experience, carbon-neutral data centers, native cloud management tools* Emphasis on technical skills, creativity, collaboration* Flexible work options and remote opportunities* Supportive environment for growth and development* Commitment to responsibility and sustainability in digital industry* Passionate about future of cloud computing* Diverse team with self-serve revenue focus
Your tracker settings

We use cookies and similar methods to recognize visitors and remember their preferences. We also use them to measure ad campaign effectiveness, target ads and analyze site traffic. To learn more about these methods, including how to disable them, view our Cookie Policy or Privacy Policy.

By tapping `Accept`, you consent to the use of these methods by us and third parties. You can always change your tracker preferences by visiting our Cookie Policy.

logo innerThatStartupJob
Discover the best startup and their job positions, all in one place.
Copyright © 2024