Skip to main content

Astera

Site Reliability Engineer

Emeryville HQ (Remote)$100k–$300kfulltimemidAdded 2 days ago

About this role

Astera's Neuro-AI program seeks a Site Reliability Engineer to manage and optimize the digital infrastructure supporting cutting-edge AI research. You'll own compute resource access, monitoring, auto-scaling, and automation across a modern cloud-native stack, ensuring researchers have reliable, efficient access to the tools they need.

What you'll do

  • Manage compute resource allocation and access control across third-party cloud platforms
  • Monitor cluster health and resource utilization with observability tools
  • Design and implement auto-scaling solutions based on research demand patterns
  • Automate operational processes to increase infrastructure efficiency
  • Ensure reproducible and deterministic deployment environments for research
  • Maintain clear operational documentation and boundaries for handoff to other engineers

What they're looking for

  • Kubernetes and container orchestration
  • Infrastructure automation (Ansible or similar)
  • Observability and monitoring (Prometheus, Grafana)
  • Linux systems administration
  • Python scripting
  • Cloud networking and distributed systems knowledge
  • Docker and containerization
  • Access control and security management
Apply on the employer's site

Opens the official application on the employer’s site. No login required.