Physical Evals

Evaluations in the actual physical world.

May 23, 2026

A physical evaluation tests an AI system in the actual physical world — not a simulator, not a sandbox, not a virtual environment dressed up as one. The point is to measure how well AI can do real things in real places.

An orchard owner has birds eating the fruit. She sets up a few cameras and a drone, brings them online safely so any agent can be invited to take a slot on the system, and poses the question: who can keep the birds off the fruit best? That setup is a physical eval. It has cameras, a drone, an orchard, birds — none of it simulated and outcomes are measured against what matters to the orchard.

leaderboard · week 12 live
#operatorfruit savedcost
1Owl-3B94%$0.18/h
2Hummingbird v289%$0.21/h
3FlockSentinel84%$0.31/h
4human baseline71%
5RoboScarecrow63%$0.09/h
A physical eval, sketched: an orchard with perimeter cameras and a deterrent drone, and a live leaderboard of operators competing to keep the birds off the fruit. Operators here are illustrative.

In this document, a few threads are developed:

  1. What physical evals are. A definition, what they’re not (simulators, sim-to-real benchmarks, curated demos), virtual environments as virtual gyms and physical evals as a final exam.
  2. An anatomy. An initial draft of components a physical eval needs, good practices.
  3. Safety for physical evals. Letting anyone on the internet drive real hardware is its own adversarial-security challenge.
  4. An open movement for physical evals. Creating simple protocols, great safety standard and economical setups could lead to a cambrian explosion of physical evals, where anyone can bring their physical challenge online.
  5. Physical evals as a market. One can set up an eval to delegate the selection of the right AI model/algorithm to competing participants.

What physical evals are

A physical eval is an evaluation of an AI system carried out in the actual physical world. The system being measured operates a real environment - fruit trees, a wet bench, a warehouse cell, a field plot - through sensors and actuators connected to the internet, so any agent can take a slot, attempt the task, and submit a score.

In principle, most problems in the physical world could be turned in a challenge for surfacing the state of the art of AI in solving that problem. In a way physical evals can act as a forcing function to saturate evaluations in the real world. Saturate as in: take the measurable outcome to its ceiling. The eval defines the ceiling; the participants find out how close they can get.

What they’re not

The final exam

A physical eval isn’t where one trains their models, but it’s where they get tested.

A virtual simulation is like a virtual gym. Due to the high cost of interacting with the real world, it is likely that all the learning — model fitting, policy iteration, RL rollouts, fine-tuning, ablations, sweeps — will happen somewhere cheaper: a simulator, a virtual environment, a closed in-house testbed. Some of the gyms people are using today: OpenAI Gym / Gymnasium, MuJoCo, PyBullet, Isaac Gym / Isaac Sim, DeepMind Lab, Habitat, AI2-THOR, CARLA, AirSim, Genesis. Participants are free to use whichever virtual world or gym they like — there is a whole landscape of simulators specifically built for this.

Differently, a physical eval is like a final exam. Build and train wherever, gather data however, iterate as much as needed — and then submit to the physical eval to see how the work holds up against the real world.

The gap between a virtual world and the physical one, sim-to-real gap, contains everything the simulator didn’t model. Wind that doesn’t blow the way it does in the sim. Lighting the renderer didn’t predict. Mechanical wear, sensor noise, calibration drift, the way birds actually respond to a drone rather than the way an idealised model of a bird does. The physical eval catches it is run in the physical world.

Examples

The cards below are sketches of what a small handful of physical evals could look like across very different domains.

Anatomy of a physical eval

The diagram below sketches one possible anatomy for a physica eval (this might not be complete, but take it as a useful starting point).

FRUIT SAVED94%↑ 31ENVIRONMENTthe orchard2SENSORSperimeter cameras3ACTION SPACEdrone, deterrents4METRIC+ secondaries5GUARDRAILSgeofence, no-fly buffers6OPEN ACCESSremote operator input7GOVERNANCEwho sets the rules
Components of a physical eval.

This list is not final. Different domains will surface components not named here — calibration drift, biological containment, human-in-the-loop sign-off, regulatory constraints — and the right abstraction is going to settle as people actually build the things.

Challenges for physical evals

The following are some of the harder design problems that don’t have clean answers yet — and that any serious physical eval effort will have to confront.

Goodharting. Any eval with a numeric target invites unintended ways to hit it — with the wrinkle that in a physical eval the unintended ways can cause real-world harm. An agent optimising fruit saved might drive the deterrent so aggressively that birds and orchard workers avoid the area: the metric goes up, the orchard becomes unusable.

Non-stationarity. The physical world changes regardless. An orchard in week one is not the same orchard in week twelve — season, weather, and pest population all shift. A wet-lab bench drifts as reagent batches age. Field plots evolve. Comparing scores across time is therefore hard, sometimes impossible.

Sequential contamination. Each participant leaves a trace for the next. In a wetlab this is problematic — reagents consumed, cultures disturbed, hardware worn — but the problem is general: stock depleted in a warehouse cell, soil compacted on a field plot, bird behaviour shifted by a heavy deterrence week. Sequential slots work for environments with a natural or cheap reset; they don’t work for environments where state accumulates.

Latency as a confound. A participant operating remotely over the internet sees the environment through a sensor stream and acts through a command channel, both of which have variable latency. Two agents with identical policies but different network conditions will produce different results. This is especially visible in fast-moving environments — a drone avoiding a collision, a robot arm catching a falling object.

Observer effect. The sensors required to score an eval change what is being measured. A camera rig that watches a field plot for pest activity may deter the pests on its own. A flow sensor on a reagent line changes the thermal environment of the bench. In some domains the effect is negligible; in others it will corrupt the primary metric.

Safety for physical evals

A physical eval that anyone on the internet can operate is, by construction, a public attack surface on a real-world physical system. The participant at any given slot might be a well-behaved research team, an AI agent following a poorly-aligned policy, or a person who wants to break things on purpose. The eval has to keep working — usefully, openly, safely — across all three.

Somebody not fully trusted is about to make the drone, the sprayer, the autoclave do something for the next twenty minutes — what’s the worst that can happen, and how is it bounded?

The attack surface

A useful first pass is to categorise harms by who pays the cost:

  1. Harm to the eval itself. The drone crashes, the cell line dies, the robot arm jams. Cheap if the guardrails work — the operator resets, the leaderboard absorbs the failure.
  2. Harm to the surrounding environment. Chemicals spill, the orchard catches fire, a neighbouring field gets sprayed.
  3. Harm to humans. A bystander gets hit by the drone, an operator gets burned, a patient sample gets switched. The lines between these categories blur in practice — chemical drift is “environment” until a bystander walks through it. The category that matters most, and the hardest to bound.
  4. Information harms. Footage of bystanders or proprietary processes leaves the eval site; the eval is used as a covert surveillance platform; sensor streams are exfiltrated.
  5. Generation of dangerous artifacts. The wet-lab cell is steered toward synthesising something harmful; the sprayer drone is weaponised; the autoclave is used to destroy evidence.

Categories 1–3 are about what can happen during a slot. Categories 4–5 are about what can leave the eval afterwards. They want different defences, and a serious eval needs both.

Defences worth building (AI GEN)

None of the following is a finished answer. They are the moves worth physical evals trying, evaluating, and writing up:

Most of these are borrowed from adjacent fields — public cloud security, scientific-facility time-sharing (telescope nights, beamline schedules), bug-bounty programs, robotics-safety standards. None of them have been worked out in detail for a public, openly-instrumented physical system that AI agents are also supposed to operate. That’s a research agenda in itself.

An open movement for physical evals

Physical evals could become an open-source ecosystem: environments cheap to set up, easy to fork, open to anyone with a problem worth measuring.

The rest of this section traces the arc: where things have been (Prior art), what an open ecosystem looks like in practice (Open at every layer), what it would have to cost (What’s the Raspberry Pi of a physical eval?).

Prior art

Physical-world AI competitions aren’t new. The DARPA Grand Challenge put autonomous vehicles in the Mojave; the DARPA Robotics Challenge put humanoids through disaster-response courses; the Amazon Picking Challenge ran in warehouse mock-ups for several years; RoboCup has been running its soccer leagues since 1997 — arguably the longest-lived physical eval in continuous operation, and the one with the most literature on what makes it work and what it ends up measuring. RoboCup has been doing its soccer leagues since the late 1990s; the Indy Autonomous Challenge and Roborace have put driverless cars on real circuits.

What these have in common: each was (or is) a sponsor-led, time-limited event with closed protocols and bespoke infrastructure. They produced brilliant moments and a small library of papers; they were expensive to build and harder to reproduce.

What’s the Raspberry Pi of a physical eval?

The hard constraint on all of this is cost. A DARPA-class eval needs millions of dollars and a multi-year program; even a modest research-grade one runs into expensive sensors, networking, fail-safe hardware, and the human labour to keep it operating. That ceiling is what makes physical evals rare today — and rare evals can’t be the basis of an ecosystem.

So one of the most important questions this community can keep returning to is the one in the heading. Stand-in for “the cheapest plausible build”. The Raspberry Pi did this for hobbyist computing; what’s the equivalent for physical evals? What’s the bill of materials that brings a credible, instrumented, openable physical eval down to the cost of a serious hobby project? Probably some mix of commodity sensors, a single-board computer for the control loop, an open scheduling service for time-share, off-the-shelf safety hardware, and a reference orchestration stack that everyone forks. If the answer ends up being “a few hundred dollars and a weekend,” the ecosystem can actually form. If it stays at “a few hundred thousand and a six-month build,” it stays a fantasy.

Open at every layer

In order to further lower the cost for anyone to be able to set up (safely) their physical evals, we need to look at off-the-shelf hardware and an open source stack.

Public verifiability

The safety section above focuses on protecting the physical environment from adversarial participants. There is a symmetric problem that gets less attention: protecting participants — and the public — from adversarial eval runners.

In an open world where anyone can wire a field, a lab bench, or a warehouse cell to the internet and declare it a physical eval, the operator controls the sensors, the scoring pipeline, and in a way, the ground truth. A dishonest operator can inflate results for a preferred team, suppress evidence of harm, or fabricate the physical record entirely. If physical evals are going to carry weight — as procurement signals, safety certifications, or policy inputs — the data they produce has to be trustworthy independent of whether the runner is trustworthy.

This is a valuable research direction in its own right. Some threads worth pulling:

Combinations are likely to be necessary, and the right combination will vary by domain. What a wet-lab needs to prove that a synthesis actually ran differs from what an orchard needs to prove that a drone actually flew a slot. Building this layer — call it physical eval verification — is at least as important as building the evals themselves to make the results publicly verifiable.

A Darwinian ecosystem

Not every physical eval will be a good one. Some will be hard, some trivial. Some will be well-structured; some will be a mess. Some will scale to many participants; some will only ever host one team at a time. That’s fine — even desirable. The shape of “what makes a good physical eval” is going to emerge from people building them, breaking them, and learning what each one actually measured.

Sketched, a public registry for such an ecosystem might look like this:

physevals.io · open registry · 126 evals
Physical eval registry
43 accepting slots
O
Orchard pest defence
agriculture · outdoor · Greenfields UK
● open
fruit saved / week
18 teams
W
pH adjustment bench
chemistry · indoor · benchtop
● open
ΔpH from target
11 teams
V
Indoor vertical farm
agriculture · controlled environment
◑ 2 slots left
g / kWh / cycle
6 teams
P
Pick-and-pack cell
logistics · warehouse robotics
● open
correct orders / hour
23 teams
S
Outdoor sprayer drone
agri-robotics · field · safety-vetted access
○ coming soon
pest Δ / ml sprayed
+
Submit an eval
open spec · CC-BY · any domain
updated 26 May · specs CC-BY physevals.io is imagined

More than evals

Every execution of a physical eval produces something beyond a score: a timestamped record of sensor readings, actions taken, and outcomes observed, all under conditions that were defined in advance and held constant across participants. That record has value on its own.

The most immediate use is data collection. A team that runs an agent on the orchard eval for a week doesn’t just get a leaderboard position — they accumulate labelled trajectories in a real environment that would be expensive to stage deliberately. Even failed attempts are informative: a drone that misses a bird on Tuesday has documentation of exactly what the environment looked like and what the agent did.

For some categories of problem the step further is worth considering: repurposing the eval environment as a training environment. The orchard is already instrumented. The slot system already handles scheduling. If the cost of running episodes is low enough — bird-deterrence is essentially free to attempt, wet-lab synthesis is not — the same infrastructure can run RL rollouts between evaluation windows. The environment that scores a model on Monday can help train the next version by Friday.

This doesn’t collapse the distinction between training and testing. Eval integrity still requires held-out conditions, independent scoring, and participants who didn’t design the environment. But the hardware doesn’t have to be idle between eval slots, and the data generated during evaluation doesn’t have to be discarded. For operators willing to share trajectories under open licences, a physical eval site becomes something closer to a living dataset — one that grows richer every time a new agent takes a slot.

Physical evals as a market

Why should one set up a physical eval? Setting up a physical eval can be a way to crowdsource intelligence for an unsolved problem.

The eval ecosystem can act as a market for intelligence: any participant — human, agent, team, company, hobbyist — can take a slot, attempt to saturate the metric, and submit. The leaderboard answers the AI-selection question by revealing whose approach actually delivers on the physical world.

Physical evals can be a way for problem-owners to delegate AI knowledge to a market. This is the same shift that happened with bug bounties. A company didn’t have to predict who the best vulnerability researchers were; they had to publish the surface and the rules, and the market sorted itself out. The grower doesn’t pick a model. The hospital doesn’t pick a model. The factory doesn’t pick a model. They pick a problem worth instrumenting and let the world’s AI builders compete to be the answer. As more domains follow suit, an aggregate picture emerges of where AI is actually good.

Get in touch

If any of this resonates, please write. Three good reasons:

DM @iamnotnicola on X.

Let’s turn more of the physical world into something AI can be measured against — and use that to point AI at problems that actually matter.

Acknowledgements

This was written by Nicola Greco with support of AI. It was brainstormed as part of ARIA’s Scaling Trust programme, in collaboration with Alex Obadia.