KAYAK

Infrastructure Operations Engineer

Concord OfficefulltimemidAdded today

About this role

KAYAK seeks an Infrastructure Operations Engineer to support development and production environments in their Concord, MA office. You'll manage infrastructure incidents, monitor system health, collaborate with engineering teams, and drive operational improvements for a high-scale travel search platform.

What you'll do

Triage and resolve infrastructure tickets from developers and business teams with clear communication
Lead root cause analysis on production incidents and implement preventive corrective actions
Monitor infrastructure health and performance using tools like LogicMonitor, Kibana, and Elasticsearch
Develop operational runbooks, SOPs, and post-incident documentation
Participate in 24/7 on-call rotations supporting critical production alerts
Automate repetitive operational tasks using Bash and Python scripting

What they're looking for

Linux systems administration (RHEL, CentOS, Ubuntu)
Bash shell scripting and Python automation
Monitoring and observability tools (LogicMonitor, Datadog, Prometheus)
Log aggregation and analysis (ELK stack, Kibana, Elasticsearch)
Cloud infrastructure experience (AWS, GCP, or Azure)
Containerization and orchestration (Docker, Kubernetes)
Incident management and ticketing systems (Jira, ServiceNow, PagerDuty)
Infrastructure-as-code tools (Terraform, Ansible, Chef)

Benefits

Work with a leading travel search platform processing billions of queries
Collaborate across software engineering, security, and platform teams
Contribute to infrastructure improvement projects and modernization initiatives
Opportunity to build institutional knowledge and improve team efficiency
Hands-on role with exposure to latest infrastructure technologies

Likely interview questions

Walk us through your experience with production incident response and root cause analysis (RCA). Can you describe a specific incident you handled, what went wrong, and how you prevented it from happening again?
Tell us about your experience with monitoring and observability tools. Which tools have you used (LogicMonitor, Datadog, Prometheus, ELK stack, etc.), and how have you used them to proactively identify infrastructure issues?

Describe your hands-on experience with Linux systems administration and shell scripting. What repetitive operational tasks have you automated, and what was the impact?
How have you collaborated with software engineering and security teams during infrastructure changes or deployments? Give an example of a cross-functional project you worked on.
What's your experience with cloud infrastructure (AWS, GCP, or Azure)? Can you describe a scenario where you had to troubleshoot or optimize cloud resources?
Tell us about your experience with containerization and orchestration technologies like Docker or Kubernetes in a production environment. What challenges have you faced?
How do you approach ticket triage and prioritization when you have multiple infrastructure requests coming in from different teams? Walk us through your process and how you communicate status.

Unlock all 7 questions free — and practice them live →

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.