Distyl AI

AI Engineer, Evaluation

San Francisco (Remote)$150k–$250kfulltimemidAdded 2 days ago

About this role

Distyl AI seeks an AI Evaluation Engineer to design and build evaluation frameworks that drive iterative improvements to production AI systems. You'll write Python code to create test suites, evaluation pipelines, and LLM-based graders that measure system quality against real-world customer needs across industries like telecom, healthcare, and manufacturing.

What you'll do

Design evaluation frameworks and define quality metrics aligned with customer business objectives and domain constraints
Build and maintain golden test cases, regression suites, and evaluation pipelines in Python using both manual and AI-assisted test generation
Develop offline and online evaluation systems that directly inform prompt design, model selection, and deployment decisions
Define and calibrate LLM-based graders, aligning automated judgments with expert human assessments
Investigate divergences between evaluation signals and real-world outcomes to maintain signal quality
Collaborate with forward-deployed engineers, architects, and domain experts to ensure evaluation guides system development

What they're looking for

Python software engineering (2+ years)
Evaluation and experimentation frameworks
Test case design and test suite development
LLM grading and calibration
Data-driven system iteration
AI tool proficiency for testing and debugging
Cross-functional communication with subject matter experts
Systems thinking across prompts, agents, and deployment

Benefits

Base salary $150K–$250K depending on experience and location
Work on mission-critical AI systems affecting hundreds of millions of consumer interactions
Backing by top-tier investors including Lightspeed, Khosla Ventures, and Coatue
Profitable enterprise AI company with 100% customer deployment success rate

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.