Skip to main content

StackAV

Site Reliability Engineer

Pittsburgh, PA or Remote (Remote)midAdded 2 days ago

About this role

Stack is seeking a Site Reliability Engineer to ensure the performance and reliability of their autonomous systems infrastructure. This role involves optimizing compute platforms for large-scale workloads while collaborating with various teams to enhance platform efficiency and reliability.

What you'll do

  • Manage scheduling and execution of large-scale workloads on Kubernetes.
  • Troubleshoot job failures for users.
  • Collaborate across teams to enhance platform capabilities.
  • Improve system reliability through automation.
  • Document processes to create a comprehensive knowledge base.
  • Participate in on-call support for production services.

What they're looking for

  • Understanding of Linux and networking
  • Experience with Kubernetes orchestration
  • Ability to guide performance scaling
  • Familiarity with cloud-native tools
  • Strong debugging skills
  • Effective communication in diverse teams

Benefits

  • Remote work option
  • Collaborative culture
  • Innovative environment
  • Commitment to diversity and inclusion
  • Opportunities for growth
Apply on the employer's site

Opens the official application on the employer’s site. No login required.