Astera
Site Reliability Engineer
Emeryville HQ (Remote)$100k–$300kfulltimemidAdded 2 days ago
About this role
Astera's Neuro-AI program seeks a Site Reliability Engineer to manage and optimize the digital infrastructure supporting cutting-edge AI research. You'll own compute resource access, monitoring, auto-scaling, and automation across a modern cloud-native stack, ensuring researchers have reliable, efficient access to the tools they need.
What you'll do
- Manage compute resource allocation and access control across third-party cloud platforms
- Monitor cluster health and resource utilization with observability tools
- Design and implement auto-scaling solutions based on research demand patterns
- Automate operational processes to increase infrastructure efficiency
- Ensure reproducible and deterministic deployment environments for research
- Maintain clear operational documentation and boundaries for handoff to other engineers
What they're looking for
- Kubernetes and container orchestration
- Infrastructure automation (Ansible or similar)
- Observability and monitoring (Prometheus, Grafana)
- Linux systems administration
- Python scripting
- Cloud networking and distributed systems knowledge
- Docker and containerization
- Access control and security management
Opens the official application on the employer’s site. No login required.