Staff / Senior Staff Data Engineer, Real-World Data

This job is no longer open

About Us

Valo Health is a technology company that integrates human-centric data and AI-powered technology to accelerate the creation of life-changing drugs. Valo was created with the belief that the drug discovery and development process can and should be faster and less expensive, with a higher success rate. We use models early to fail less often as we reinvent drug discovery and development from the ground up. Disease doesn’t wait, so neither can we.We are a multi-disciplinary team of experts in science, technology, and pharmaceuticals united in our mission to achieve better drugs for patients, faster.

Valo is committed to hiring diverse talent, prioritizing growth and development, fostering an inclusive environment, and bringing together a group of different experiences, backgrounds, and voices to work together. We achieve the widest-ranging impact when we leverage our broad backgrounds and perspectives.Valo’s machine learning and AI capabilities are built on high-quality, high-density human-centric data from multiple sources: that’s where you come in!

About the Role

As a Staff / Senior Staff Data Engineer, you will join the data engineering core in the Translational Data Sciences group, working with data scientists and engineers building powerful computational tools and answering critical scientific questions about patients, diseases, and drug development.In this role, you will lead the development, road mapping, and execution of complex initiatives to transform real-world data (eg, electronic medical records, biomarkers and biomedical imaging, and text notes) into analysis-ready data products for internal teams.

To do so you will partner with a diverse set of scientists, engineers, and domain experts across traditional industry boundaries. Primary downstream use cases of these data are longitudinal deep learning models of patient trajectories, and knowledge graph integration for target identification, statistical genetics, and multi-omics modeling.

What You'll Do...

Build, maintain, and extend data transformation pipelines and systems to ingest and harmonize third-party EHR data into Valo’s data ecosystems
Define Valo’s EHR data models and pipelines (spark, SQL) in a centralized data ecosystem and semi-isolated cloud environments.
Work closely with data providers and in-house data users to integrate third-party EHR data with Valo’s standardized data
Maintain and extend data integration (standardization & harmonization) & data quality processes to improve quality, reliability, and FAIRness
Ensure conceptual accuracy and generalizability of data: do standardized derived features represent clinical concepts in repeatable ways?
Simplify how data scientists access, transform, and use their data
Promote consistent data usage patterns, including version management, shared ontologies & data dictionaries
Support internal data users both directly and by composing demos, how-tos, and reference documentation
Provide technical leadership within the translational data engineering team
Simplify how data engineers build, maintain, and extend their data pipelines
Advise colleagues on data transformations and database design
Provide critical feedback and encourage best practices within the data engineering team
Participate in the creation and maintenance of technical documentation

What You Bring...

Bachelor’s degree + 8 (staff) /10 (sr staff) years of experience, MS + 6/8 YOE, PhD + 5/7 YOE in computer science, information systems, or data science
5+ yrs experience in a technical role in:
SWE / DE: data ingestion, streaming technologies, troubleshooting data pipelines (eg prefect, airflow) and implement CI/CD practices
Production programming experience in python & SQL; cloud compute and big data tools, eg spark
3+ yrs experience in a professional role gathering requirements and understanding customers/data users goals
Demonstrated experience scoping projects, determining timelines and milestones, delivering end-to-end projects
Technical project management experience (scoping, defining milestones & timelines) a plus
Experience with EHR/EMR data and medical coding ontologies (eg, ICD, ATC, LOINC, SNOMED)
Nice to have: experience with sparse longitudinal records, eg customer / log data with historical ontologies – about the concepts, distinct from data provenance & qualitative data and coding structures
Experience with data engineering best practices and testing methodologies (data provenance, collaborative development using source control management (git), code versioning, reproducibility, etc)