Cognition
Site Reliability Engineer
San Francisco$260k–$300kfulltimemidAdded 2 days ago
About this role
Cognition, maker of Devin (AI software engineer) and Windsurf (AI IDE), seeks an SRE to own production reliability and platform engineering for products used by hundreds of thousands of developers. You'll define SLOs, lead incident response, build CI/CD infrastructure, and ensure systems scale reliably as the company grows.
What you'll do
- Define and own SLOs, SLIs, and error budgets for production systems with comprehensive monitoring and alerting
- Lead incident response, run blameless postmortems, and build runbooks for sustainable on-call operations
- Build and maintain deployment pipelines, release infrastructure, and internal developer tooling
- Manage cloud infrastructure through infrastructure-as-code practices and eliminate configuration drift
- Model capacity needs, forecast resource requirements, and optimize system performance proactively
- Partner with engineering teams to embed reliability and security considerations into product architecture
What they're looking for
- Production systems management at scale with SLOs and error budgets
- Strong software engineering fundamentals and real code writing
- Cloud infrastructure expertise (AWS, GCP, or Azure)
- Kubernetes and container orchestration
- Infrastructure as code (Terraform or equivalent)
- CI/CD pipeline design and deployment automation
- Systems observability, monitoring, and alerting design
- Incident command and root cause analysis
Benefits
- Base salary: $260,000 - $300,000
- Significant early-stage equity
- High ownership in a small, talented team
- Work with cutting-edge AI products used at scale
- High-trust environment with autonomy to set reliability standards
- Culture that treats reliability as engineering craft
Opens the official application on the employer’s site. No login required.