Blaxel

Site Reliability Engineer

San Francisco$175k–$250kfulltimemidAdded 2 days ago

About this role

A Site Reliability Engineer is sought to build and operate the core infrastructure powering an AI compute platform, ensuring ultra-low-latency performance and exceptional reliability at scale. You'll architect observability systems, lead incident response, design automation to eliminate operational toil, and collaborate with infrastructure and development teams to keep AI workloads running smoothly.

What you'll do

Architect and operate the 25ms cold-start compute engine and core infrastructure systems
Build and evolve observability stacks (metrics, traces, logs) with SLO/SLI monitoring
Lead incident response with root cause analysis, post-mortems, and systemic fixes
Design self-healing, automated systems to reduce toil and enable scaling
Perform stress testing, chaos engineering, and performance benchmarking across compute, networking, and storage layers
Own infrastructure-layer security practices including sandboxed compute and network isolation

What they're looking for

Go, Rust, or Python programming
Linux systems, networking fundamentals, and distributed systems
Bare-metal server and datacenter operations (PXE, IPMI, RAID, SR-IOV)
Kubernetes or container orchestration
Observability tools (Prometheus, Grafana, ELK, Datadog)
CI/CD pipeline management (GitHub Actions, GitLab CI, Jenkins)
AWS or GCP cloud platforms
Incident management and debugging

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.