KAYAK
Infrastructure Operations Engineer
Concord OfficefulltimemidAdded today
About this role
KAYAK seeks an Infrastructure Operations Engineer to support development and production environments in their Concord, MA office. You'll manage infrastructure incidents, monitor system health, collaborate with engineering teams, and drive operational improvements for a high-scale travel search platform.
What you'll do
- Triage and resolve infrastructure tickets from developers and business teams with clear communication
- Lead root cause analysis on production incidents and implement preventive corrective actions
- Monitor infrastructure health and performance using tools like LogicMonitor, Kibana, and Elasticsearch
- Develop operational runbooks, SOPs, and post-incident documentation
- Participate in 24/7 on-call rotations supporting critical production alerts
- Automate repetitive operational tasks using Bash and Python scripting
What they're looking for
- Linux systems administration (RHEL, CentOS, Ubuntu)
- Bash shell scripting and Python automation
- Monitoring and observability tools (LogicMonitor, Datadog, Prometheus)
- Log aggregation and analysis (ELK stack, Kibana, Elasticsearch)
- Cloud infrastructure experience (AWS, GCP, or Azure)
- Containerization and orchestration (Docker, Kubernetes)
- Incident management and ticketing systems (Jira, ServiceNow, PagerDuty)
- Infrastructure-as-code tools (Terraform, Ansible, Chef)
Benefits
- Work with a leading travel search platform processing billions of queries
- Collaborate across software engineering, security, and platform teams
- Contribute to infrastructure improvement projects and modernization initiatives
- Opportunity to build institutional knowledge and improve team efficiency
- Hands-on role with exposure to latest infrastructure technologies
Likely interview questions
- Walk us through your experience with production incident response and root cause analysis (RCA). Can you describe a specific incident you handled, what went wrong, and how you prevented it from happening again?
- Tell us about your experience with monitoring and observability tools. Which tools have you used (LogicMonitor, Datadog, Prometheus, ELK stack, etc.), and how have you used them to proactively identify infrastructure issues?
Opens the official application on the employer’s site. No login required.