job description
Join RealPage Philippines as a Site Reliability Engineer (SRE) β Operations and play a pivotal role in ensuring the stability, performance, and scalability of our mission-critical systems. As part of our global Reliability Engineering team, you will collaborate with cross-functional teams to optimize infrastructure, automate operational tasks, and drive continuous improvement in system reliability.
This is a unique opportunity to work in a dynamic, fast-paced environment where innovation meets operational excellence. Whether you're troubleshooting complex issues, designing resilient architectures, or implementing cutting-edge monitoring solutions, your work will directly impact the reliability and performance of our platforms.
If you are passionate about reliability engineering, thrive in collaborative settings, and enjoy solving challenging problems, we want to hear from you!
Responsibility
- Ensure the stability, performance, and scalability of production systems through proactive monitoring, incident response, and root cause analysis.
- Design, implement, and maintain automated solutions for deployment, monitoring, and incident management.
- Collaborate with development and operations teams to improve system reliability, reduce downtime, and enhance user experience.
- Develop and maintain documentation for operational procedures, runbooks, and best practices.
- Participate in on-call rotations to provide 24/7 support for critical systems and services.
- Optimize system performance by identifying bottlenecks, tuning configurations, and implementing efficiency improvements.
- Drive the adoption of SRE principles, including error budgets, SLIs, SLOs, and SLAs, across the organization.
- Stay updated with industry trends and emerging technologies to continuously improve our reliability engineering practices.
Qualifications
- Bachelorβs degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or in a similar operational role.
- Strong proficiency in scripting and automation using languages such as Python, Bash, or Go.
- Experience with cloud platforms (AWS, GCP, or Azure) and containerization technologies (Docker, Kubernetes).
- Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK Stack, or similar).
- Solid understanding of networking, security, and infrastructure-as-code (Terraform, Ansible, etc.).
- Excellent problem-solving skills and the ability to troubleshoot complex system issues under pressure.
- Strong communication and collaboration skills to work effectively with cross-functional teams.