Cognition

Site Reliability Engineer

San Francisco$260k–$300kfulltimemidAdded 2 days ago

About this role

Cognition, maker of Devin (AI software engineer) and Windsurf (AI IDE), seeks an SRE to own production reliability and platform engineering for products used by hundreds of thousands of developers. You'll define SLOs, lead incident response, build CI/CD infrastructure, and ensure systems scale reliably as the company grows.

What you'll do

Define and own SLOs, SLIs, and error budgets for production systems with comprehensive monitoring and alerting
Lead incident response, run blameless postmortems, and build runbooks for sustainable on-call operations
Build and maintain deployment pipelines, release infrastructure, and internal developer tooling
Manage cloud infrastructure through infrastructure-as-code practices and eliminate configuration drift
Model capacity needs, forecast resource requirements, and optimize system performance proactively
Partner with engineering teams to embed reliability and security considerations into product architecture

What they're looking for

Production systems management at scale with SLOs and error budgets
Strong software engineering fundamentals and real code writing
Cloud infrastructure expertise (AWS, GCP, or Azure)
Kubernetes and container orchestration
Infrastructure as code (Terraform or equivalent)
CI/CD pipeline design and deployment automation
Systems observability, monitoring, and alerting design
Incident command and root cause analysis

Benefits

Base salary: $260,000 - $300,000
Significant early-stage equity
High ownership in a small, talented team
Work with cutting-edge AI products used at scale
High-trust environment with autonomy to set reliability standards
Culture that treats reliability as engineering craft

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.