StackAV
Site Reliability Engineer
Pittsburgh, PA or Remote (Remote)midAdded 2 days ago
About this role
Stack is seeking a Site Reliability Engineer to ensure the performance and reliability of their autonomous systems infrastructure. This role involves optimizing compute platforms for large-scale workloads while collaborating with various teams to enhance platform efficiency and reliability.
What you'll do
- Manage scheduling and execution of large-scale workloads on Kubernetes.
- Troubleshoot job failures for users.
- Collaborate across teams to enhance platform capabilities.
- Improve system reliability through automation.
- Document processes to create a comprehensive knowledge base.
- Participate in on-call support for production services.
What they're looking for
- Understanding of Linux and networking
- Experience with Kubernetes orchestration
- Ability to guide performance scaling
- Familiarity with cloud-native tools
- Strong debugging skills
- Effective communication in diverse teams
Benefits
- Remote work option
- Collaborative culture
- Innovative environment
- Commitment to diversity and inclusion
- Opportunities for growth
Opens the official application on the employer’s site. No login required.