job description
Are you ready to be the operational backbone of cutting-edge AI innovation? TELUS Digital is seeking a highly skilled SRE DevOps Engineer to join our elite AI Automation team in Metro Manila. In this high-impact role, you will bridge the gap between development and operations, ensuring the reliability, scalability, and security of our high-stakes Generative AI (GenAI) solutions.
You will work in close collaboration with our AI/ML Developers to deploy, monitor, and optimize complex models and infrastructure. We are looking for an engineer who thrives in fast-paced environments, excels at troubleshooting distributed systems, and is passionate about automating the software delivery lifecycle.
If you are an infrastructure enthusiast with a focus on cloud-native technologies and AI integration, this is your opportunity to shape the future of automation at a global scale. Join TELUS Digital and help us maintain the performance standards that power our most sophisticated AI projects.
Responsibility
- Design, build, and maintain scalable infrastructure to support high-stakes GenAI applications and ML pipelines.
- Implement and manage CI/CD pipelines to streamline deployment workflows for AI/ML models.
- Monitor system performance, implement observability tools, and troubleshoot production issues to ensure 99.9% uptime.
- Automate operational tasks, infrastructure provisioning, and configuration management using Infrastructure as Code (IaC) tools.
- Collaborate with AI/ML Developers to optimize containerized applications using Kubernetes.
- Manage cloud infrastructure (AWS/GCP/Azure) with a focus on cost-optimization and high availability.
- Establish robust security practices and compliance standards for AI data handling and model serving.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- 3+ years of professional experience in Site Reliability Engineering (SRE) or DevOps roles.
- Proficiency in cloud infrastructure management (AWS, GCP, or Azure) and container orchestration with Kubernetes.
- Strong experience with IaC tools such as Terraform, Pulumi, or Ansible.
- Solid scripting skills in Python, Go, or Bash for automation purposes.
- Experience managing CI/CD tools (e.g., GitHub Actions, GitLab CI, or Jenkins).
- Understanding of MLOps lifecycle, model serving platforms, and vector databases is a significant advantage.
- Excellent analytical, problem-solving, and communication skills in a global team environment.