Hippocratic AI

LLM Inference Engineer

Palo AltofulltimemidAdded today

About this role

Hippocratic AI, a healthcare-focused generative AI company, seeks an LLM Inference Engineer to optimize and scale their large language model serving infrastructure. You'll design distributed inference architectures, implement advanced optimization techniques, and ensure efficient deployment of safety-critical clinical AI systems.

What you'll do

Design and implement multi-node serving architectures for distributed LLM inference
Optimize multi-LoRA serving systems and apply quantization techniques (FP4/FP6)
Implement speculative decoding and latency optimization strategies
Develop disaggregated serving solutions with optimized caching for prefill and decoding
Benchmark and improve system performance across deployment scenarios and GPU types
Collaborate with healthcare and AI experts to ensure production-grade performance

What they're looking for

LLM inference optimization at scale
Distributed serving architectures for large language models
Quantization techniques for transformer models
Speculative decoding and draft model implementation
Python and C++ programming
CUDA programming and GPU optimization
Custom CUDA kernel development (nice-to-have)
Open-source LLM frameworks (vLLM, SGLang, TensorRT-LLM)

Benefits

Work on safety-focused healthcare AI at a well-funded startup ($404M total funding)
Collaborate with physicians, AI pioneers, and researchers from top institutions
Five-day-per-week in-office role in Palo Alto supporting strong team culture
Opportunity to shape category-defining healthcare AI technology
Backed by leading healthcare and AI investors (a16z, Kleiner Perkins, CapitalG)

Likely interview questions

Walk us through a specific LLM inference optimization project you've built—what techniques did you use, what were the performance bottlenecks, and how did you measure success?
Describe your hands-on experience with quantization techniques like FP4/FP6. How have you balanced model quality with reduction in footprint, and what trade-offs did you encounter?

Tell us about your experience with speculative decoding and draft models. How would you implement or optimize this for a multi-LoRA serving system?
Have you worked with distributed serving architectures for LLMs across multiple nodes? What challenges did you face with prefill/decoding disaggregation and caching strategies?
Describe your experience with CUDA programming and GPU optimization. Can you give an example of a custom kernel or low-level optimization you've implemented?
Have you contributed to or worked with open-source inference frameworks like vLLM, SGLang, or TensorRT-LLM? What motivated those contributions?
How would you approach benchmarking and profiling an LLM serving system across different GPU types and deployment scenarios to identify optimization opportunities?

Unlock all 7 questions free — and practice them live →

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.