agi-inc

Research Engineer - Evals

San Francisco OfficefulltimemidAdded 2 days ago

About this role

Join a stealth AI startup building consumer-grade AGI agents to develop the evaluation infrastructure that gates all model and agent releases. You'll own the eval harness across capability, behavior, and user experience—setting the standard for what ships and protecting it from product pressure.

What you'll do

Build and maintain eval suites for model capabilities, agent behavior, regressions, and human-rated rubrics
Create dashboards and tooling to accelerate researcher experiments and inform leadership decisions
Define and defend the quality bar for what's ready to ship
Instrument real-user behavior on real devices to inform product decisions
Advise research and product teams on measurement strategies for non-deterministic systems
Translate eval results into language partners and OEMs can hold the company accountable to

What they're looking for

Evaluating non-deterministic systems and agentic behavior
Long-horizon task measurement and tool-use assessment
Multilingual AI behavior evaluation
Metrics design and preventing gaming of metrics
Dashboard and instrumentation tooling
On-device performance measurement
Benchmark and evaluation system design
Communication across research, product, and partnerships

Benefits

Competitive cash compensation and meaningful equity
Top-tier relocation and immigration support
In-person role in San Francisco office
Work with elite founders and researchers from Stanford, OpenAI, and DeepMind
Shape research roadmap through measurement insights
Direct influence on product releases and company standards

Apply on the employer's site →

Opens the official application on the employer’s site. No login required.