Skip to main content

KAYAK

Infrastructure Operations Engineer

Concord OfficefulltimemidAdded today

About this role

KAYAK seeks an Infrastructure Operations Engineer to support development and production environments in their Concord, MA office. You'll manage infrastructure incidents, monitor system health, collaborate with engineering teams, and drive operational improvements for a high-scale travel search platform.

What you'll do

  • Triage and resolve infrastructure tickets from developers and business teams with clear communication
  • Lead root cause analysis on production incidents and implement preventive corrective actions
  • Monitor infrastructure health and performance using tools like LogicMonitor, Kibana, and Elasticsearch
  • Develop operational runbooks, SOPs, and post-incident documentation
  • Participate in 24/7 on-call rotations supporting critical production alerts
  • Automate repetitive operational tasks using Bash and Python scripting

What they're looking for

  • Linux systems administration (RHEL, CentOS, Ubuntu)
  • Bash shell scripting and Python automation
  • Monitoring and observability tools (LogicMonitor, Datadog, Prometheus)
  • Log aggregation and analysis (ELK stack, Kibana, Elasticsearch)
  • Cloud infrastructure experience (AWS, GCP, or Azure)
  • Containerization and orchestration (Docker, Kubernetes)
  • Incident management and ticketing systems (Jira, ServiceNow, PagerDuty)
  • Infrastructure-as-code tools (Terraform, Ansible, Chef)

Benefits

  • Work with a leading travel search platform processing billions of queries
  • Collaborate across software engineering, security, and platform teams
  • Contribute to infrastructure improvement projects and modernization initiatives
  • Opportunity to build institutional knowledge and improve team efficiency
  • Hands-on role with exposure to latest infrastructure technologies

Likely interview questions

  • Walk us through your experience with production incident response and root cause analysis (RCA). Can you describe a specific incident you handled, what went wrong, and how you prevented it from happening again?
  • Tell us about your experience with monitoring and observability tools. Which tools have you used (LogicMonitor, Datadog, Prometheus, ELK stack, etc.), and how have you used them to proactively identify infrastructure issues?
Apply on the employer's site

Opens the official application on the employer’s site. No login required.