Skip to main content

SambaNova Systems

Cloud Site Reliability Engineer

San Jose, California, United StatesmidAdded 2 days ago

About this role

As a Cloud Site Reliability Engineer at SambaNova Systems, you'll ensure the reliability and performance of their AI Inferencing Service, playing a crucial role in operational excellence. You will focus on maintaining high availability and low-latency responses while collaborating on innovative solutions to enhance service efficiency.

What you'll do

  • Oversee the production inferencing service for availability and performance
  • Participate in a balanced 24/7 on-call rotation
  • Develop advanced monitoring and alerting systems
  • Implement performance optimizations and auto-scaling policies
  • Manage cloud infrastructure using IaC tools
  • Handle capacity planning and forecasting of infrastructure needs

What they're looking for

  • Cloud infrastructure management
  • Monitoring and alerting tools (e.g., Prometheus, Grafana)
  • Performance optimization
  • Automation and CI/CD practices
  • Incident management
  • Capacity planning
  • Collaboration and communication
  • Problem-solving

Benefits

  • Support for work-life balance in on-call rotations
  • Opportunities for automation and system design improvements
  • Participation in a cutting-edge AI technology environment
  • Shared ownership of service reliability
  • Professional development in high-performance computing
Apply on the employer's site

Opens the official application on the employer’s site. No login required.