Devops Engineer - Machine Learning
CoMind
At CoMind, we are developing a non-invasive neuromonitoring technology that will result in a new era of clinical brain monitoring. In joining us, you will be helping to create cutting-edge technologies that will improve how we diagnose and treat brain disorders, ultimately improving and saving the lives of patients across the world.
The Role
CoMind is seeking a skilled DevOps Engineer to join our dynamic Research Data Science team to lead the orchestration of a robust ML training pipeline in AWS. This role is critical to enabling the scalable training and testing of a range of ML models on large volumes of a totally new form of clinical neuromonitoring data.
Responsibilities:
Architect and implement a scalable solution to support the Research Data Science Team in running a large number of assorted machine learning pipelines, including model training, evaluation, and inference
Create a CI/CD pipeline for building containers from in-house Python packages, running integration tests, and publishing to AWS ECR
Set up ECS or AWS Batch Tasks to run containers stored in AWS ECR
Establish a robust configuration management system to store, version, and retrieve configurations associated with multiple machine learning workflows
Implement robust error handling and monitoring solutions to ensure timely debugging across the pipeline with centralised logging and error reporting
Implement cost monitoring solutions to track and manage compute costs across different runs, building dashboards to provide insights into resource usage and cost optimization
Ensure security and data protection are integrated into the pipelines by applying AWS best practices for security protocols and data management
Monitor and manage the team's compute resources, including both cloud (AWS) and on-premise GPU nodes, ensuring efficient use and scalability
-
Implement Infrastructure as Code (IaC) to set up and manage the pipeline architecture, using Terraform, AWS CloudFormation, or similar tools.
Skills & Experience:
Git or Bitbucket for version control, including experience with managing versioned infrastructure-as-code (IaC) repositories
CI/CD pipelines for automating workflows, including experience with integration testing and containerization pipelines
Experience managing and orchestrating complex cloud workflows (e.g., ECS Tasks, AWS Batch), with a focus on event-driven and parallel processing
Infrastructure as Code (IaC) experience (e.g., Terraform, AWS CloudFormation) for creating, maintaining, and scaling cloud infrastructure
-
Docker for containerization, including experience with containerizing machine learning workflows and publishing containers to repositories like AWS ECR.
Benefits:
Company equity plan
Company pension scheme
Private medical, dental and vision insurance
Group life assurance
Comprehensive mental health support and resources
Unlimited holiday allowance (+ bank holidays)
Hybrid working (3 days in-office)
Quarterly work-from-anywhere policy
Weekly lunches
Breakfast and snacks provided.