Blaxel
Site Reliability Engineer
San Francisco$175k–$250kfulltimemidAdded 2 days ago
About this role
A Site Reliability Engineer is sought to build and operate the core infrastructure powering an AI compute platform, ensuring ultra-low-latency performance and exceptional reliability at scale. You'll architect observability systems, lead incident response, design automation to eliminate operational toil, and collaborate with infrastructure and development teams to keep AI workloads running smoothly.
What you'll do
- Architect and operate the 25ms cold-start compute engine and core infrastructure systems
- Build and evolve observability stacks (metrics, traces, logs) with SLO/SLI monitoring
- Lead incident response with root cause analysis, post-mortems, and systemic fixes
- Design self-healing, automated systems to reduce toil and enable scaling
- Perform stress testing, chaos engineering, and performance benchmarking across compute, networking, and storage layers
- Own infrastructure-layer security practices including sandboxed compute and network isolation
What they're looking for
- Go, Rust, or Python programming
- Linux systems, networking fundamentals, and distributed systems
- Bare-metal server and datacenter operations (PXE, IPMI, RAID, SR-IOV)
- Kubernetes or container orchestration
- Observability tools (Prometheus, Grafana, ELK, Datadog)
- CI/CD pipeline management (GitHub Actions, GitLab CI, Jenkins)
- AWS or GCP cloud platforms
- Incident management and debugging
Opens the official application on the employer’s site. No login required.