Skip to main content

Cognition

Site Reliability Engineer

San Francisco$260k–$300kfulltimemidAdded 2 days ago

About this role

Cognition, maker of Devin (AI software engineer) and Windsurf (AI IDE), seeks an SRE to own production reliability and platform engineering for products used by hundreds of thousands of developers. You'll define SLOs, lead incident response, build CI/CD infrastructure, and ensure systems scale reliably as the company grows.

What you'll do

  • Define and own SLOs, SLIs, and error budgets for production systems with comprehensive monitoring and alerting
  • Lead incident response, run blameless postmortems, and build runbooks for sustainable on-call operations
  • Build and maintain deployment pipelines, release infrastructure, and internal developer tooling
  • Manage cloud infrastructure through infrastructure-as-code practices and eliminate configuration drift
  • Model capacity needs, forecast resource requirements, and optimize system performance proactively
  • Partner with engineering teams to embed reliability and security considerations into product architecture

What they're looking for

  • Production systems management at scale with SLOs and error budgets
  • Strong software engineering fundamentals and real code writing
  • Cloud infrastructure expertise (AWS, GCP, or Azure)
  • Kubernetes and container orchestration
  • Infrastructure as code (Terraform or equivalent)
  • CI/CD pipeline design and deployment automation
  • Systems observability, monitoring, and alerting design
  • Incident command and root cause analysis

Benefits

  • Base salary: $260,000 - $300,000
  • Significant early-stage equity
  • High ownership in a small, talented team
  • Work with cutting-edge AI products used at scale
  • High-trust environment with autonomy to set reliability standards
  • Culture that treats reliability as engineering craft
Apply on the employer's site

Opens the official application on the employer’s site. No login required.