Distyl AI
AI Engineer, Evaluation
San Francisco (Remote)$150k–$250kfulltimemidAdded 2 days ago
About this role
Distyl AI seeks an AI Evaluation Engineer to design and build evaluation frameworks that drive iterative improvements to production AI systems. You'll write Python code to create test suites, evaluation pipelines, and LLM-based graders that measure system quality against real-world customer needs across industries like telecom, healthcare, and manufacturing.
What you'll do
- Design evaluation frameworks and define quality metrics aligned with customer business objectives and domain constraints
- Build and maintain golden test cases, regression suites, and evaluation pipelines in Python using both manual and AI-assisted test generation
- Develop offline and online evaluation systems that directly inform prompt design, model selection, and deployment decisions
- Define and calibrate LLM-based graders, aligning automated judgments with expert human assessments
- Investigate divergences between evaluation signals and real-world outcomes to maintain signal quality
- Collaborate with forward-deployed engineers, architects, and domain experts to ensure evaluation guides system development
What they're looking for
- Python software engineering (2+ years)
- Evaluation and experimentation frameworks
- Test case design and test suite development
- LLM grading and calibration
- Data-driven system iteration
- AI tool proficiency for testing and debugging
- Cross-functional communication with subject matter experts
- Systems thinking across prompts, agents, and deployment
Benefits
- Base salary $150K–$250K depending on experience and location
- Work on mission-critical AI systems affecting hundreds of millions of consumer interactions
- Backing by top-tier investors including Lightspeed, Khosla Ventures, and Coatue
- Profitable enterprise AI company with 100% customer deployment success rate
Opens the official application on the employer’s site. No login required.