SambaNova Systems
Cloud Site Reliability Engineer
San Jose, California, United StatesmidAdded 2 days ago
About this role
As a Cloud Site Reliability Engineer at SambaNova Systems, you'll ensure the reliability and performance of their AI Inferencing Service, playing a crucial role in operational excellence. You will focus on maintaining high availability and low-latency responses while collaborating on innovative solutions to enhance service efficiency.
What you'll do
- Oversee the production inferencing service for availability and performance
- Participate in a balanced 24/7 on-call rotation
- Develop advanced monitoring and alerting systems
- Implement performance optimizations and auto-scaling policies
- Manage cloud infrastructure using IaC tools
- Handle capacity planning and forecasting of infrastructure needs
What they're looking for
- Cloud infrastructure management
- Monitoring and alerting tools (e.g., Prometheus, Grafana)
- Performance optimization
- Automation and CI/CD practices
- Incident management
- Capacity planning
- Collaboration and communication
- Problem-solving
Benefits
- Support for work-life balance in on-call rotations
- Opportunities for automation and system design improvements
- Participation in a cutting-edge AI technology environment
- Shared ownership of service reliability
- Professional development in high-performance computing
Opens the official application on the employer’s site. No login required.