Physical Evals
Evaluations in the actual physical world.
May 23, 2026
A physical evaluation tests an AI system in the actual physical world — not a simulator, not a sandbox, not a virtual environment dressed up as one. The point is to measure how well AI can do real things in real places.
An orchard owner has birds eating the fruit. She sets up a few cameras and a drone, brings them online safely so any agent can be invited to take a slot on the system, and poses the question: who can keep the birds off the fruit best? That setup is a physical eval. It has cameras, a drone, an orchard, birds — none of it simulated and outcomes are measured against what matters to the orchard.
| # | operator | fruit saved | cost |
|---|---|---|---|
| 1 | Owl-3B | 94% | $0.18/h |
| 2 | Hummingbird v2 | 89% | $0.21/h |
| 3 | FlockSentinel | 84% | $0.31/h |
| 4 | human baseline | 71% | — |
| 5 | RoboScarecrow | 63% | $0.09/h |
In this document, a few threads are developed:
- What physical evals are. A definition, what they’re not (simulators, sim-to-real benchmarks, curated demos), virtual environments as virtual gyms and physical evals as a final exam.
- An anatomy. An initial draft of components a physical eval needs, good practices.
- Safety for physical evals. Letting anyone on the internet drive real hardware is its own adversarial-security challenge.
- An open movement for physical evals. Creating simple protocols, great safety standard and economical setups could lead to a cambrian explosion of physical evals, where anyone can bring their physical challenge online.
- Physical evals as a market. One can set up an eval to delegate the selection of the right AI model/algorithm to competing participants.
What physical evals are
A physical eval is an evaluation of an AI system carried out in the actual physical world. The system being measured operates a real environment - fruit trees, a wet bench, a warehouse cell, a field plot - through sensors and actuators connected to the internet, so any agent can take a slot, attempt the task, and submit a score.
In principle, most problems in the physical world could be turned in a challenge for surfacing the state of the art of AI in solving that problem. In a way physical evals can act as a forcing function to saturate evaluations in the real world. Saturate as in: take the measurable outcome to its ceiling. The eval defines the ceiling; the participants find out how close they can get.
What they’re not
- Not simulators. A simulator models reality. A physical eval is reality. In a way it, testing systems in the real world will avoid running into simulation edge cases.
- Not sim-to-real benchmarks. Sim-to-real measures how well a policy trained in a simulator transfers to a single in-house robot in a lab.
- Not curated demos. The environment operator and the participants are two distinct parties and the participants are in competition with each other. In other words, physical evals will be better than demos at showcasing the best technology for a specific task.
The final exam
A physical eval isn’t where one trains their models, but it’s where they get tested.
A virtual simulation is like a virtual gym. Due to the high cost of interacting with the real world, it is likely that all the learning — model fitting, policy iteration, RL rollouts, fine-tuning, ablations, sweeps — will happen somewhere cheaper: a simulator, a virtual environment, a closed in-house testbed. Some of the gyms people are using today: OpenAI Gym / Gymnasium, MuJoCo, PyBullet, Isaac Gym / Isaac Sim, DeepMind Lab, Habitat, AI2-THOR, CARLA, AirSim, Genesis. Participants are free to use whichever virtual world or gym they like — there is a whole landscape of simulators specifically built for this.
Differently, a physical eval is like a final exam. Build and train wherever, gather data however, iterate as much as needed — and then submit to the physical eval to see how the work holds up against the real world.
The gap between a virtual world and the physical one, sim-to-real gap, contains everything the simulator didn’t model. Wind that doesn’t blow the way it does in the sim. Lighting the renderer didn’t predict. Mechanical wear, sensor noise, calibration drift, the way birds actually respond to a drone rather than the way an idealised model of a bird does. The physical eval catches it is run in the physical world.
Examples
The cards below are sketches of what a small handful of physical evals could look like across very different domains.
Orchard pest defence
Cameras and a drone over a few rows of trees. Keep the wildlife out without poisoning the orchard or annoying the neighbours.
- environment
- ~1 acre of fruit trees, outdoor, weather-exposed.
- action space
- Fly the drone, emit deterrent sound, trigger light pulse, dispense small bait.
- sensors
- Fixed perimeter cameras, drone camera, microphone, weather station.
- primary metric
- Fruit lost to wildlife per week.
- secondaries
- Drone flight-time, energy, chemical use, neighbour-complaint count.
- guardrails
- Geofenced drone, quiet-hour windows, no-spray buffer near road, fail-safe tether.
pH adjustment bench
A beaker on a magnetic stirrer, a pH probe, and two motorised dispensers. Hit a target pH using as little reagent as possible.
- environment
- Benchtop — one beaker on a stirrer, two motorised dispensers (acid, base), pH probe.
- action space
- Dispense a chosen volume from either reservoir; read current pH.
- sensors
- pH electrode, balance, camera.
- primary metric
- Absolute pH error from target at submission.
- secondaries
- Total volume dispensed, time to endpoint.
- guardrails
- Per-session dispense quota, pH range limits (3–11), auto-stop on quota exhaustion.
Indoor vertical farm
A closed grow rack — lights, pumps, nutrient dosing, cameras. Pull more food out of every kilowatt.
- environment
- One 2-tier rack, ~3 m², climate-isolated.
- action space
- Light schedule + intensity, nutrient mix, irrigation timing, harvest decision.
- sensors
- Cameras (overhead + side), EC / pH probes, water-flow meters, kWh meter, scale at harvest.
- primary metric
- Grams of edible biomass per kWh per cycle.
- secondaries
- Cycle time, water used, nutrient cost, reject rate.
- guardrails
- Nutrient-concentration ceiling, water-overflow drain, light-burn cutoff, max-cycle length.
Warehouse pick-and-pack cell
An off-the-shelf robot arm in front of mixed shelves and a conveyor. The boring industrial baseline — still worth opening up.
- environment
- Fenced robot cell, ~9 m², fixed lighting.
- action space
- Arm motion, grip force, scan, label, place on conveyor.
- sensors
- Wrist camera, overhead camera, barcode scanner, weight pad, joint torques.
- primary metric
- Correctly packed orders per hour.
- secondaries
- Mis-pick rate, damage rate, energy per pick.
- guardrails
- Safety fence + light curtain, e-stop, force-limited arm, max-velocity cap.
Outdoor sprayer drone
A tank-equipped drone with a multispectral camera, working a real field. The hardest adversarial-robustness story of the bunch.
- environment
- A bounded field plot, outdoor, with weather and bystanders.
- action space
- Flight path, spray nozzle on/off, dosage rate.
- sensors
- RGB + multispectral camera, GPS, IMU, tank-level sensor, wind sensor.
- primary metric
- Pest pressure reduction, normalised by chemical applied.
- secondaries
- Chemical drift, energy, flight time, area covered.
- guardrails
- Geofence + tether, no-fly buffer around bystanders, chemical-flow ceiling, weather lockout.
Gel electrophoresis station
An agarose gel tray, a power supply, and a UV camera. Set voltage and run time, then image the separated bands.
- environment
- Benchtop — gel box with buffer, power supply, UV transilluminator.
- action space
- Set voltage (10–150 V), run time, and sample loading volumes per lane.
- sensors
- UV camera, voltmeter, timer, buffer-level sensor.
- primary metric
- Target-band separation score at imaging time.
- secondaries
- Run time, buffer consumption, gel waste.
- guardrails
- Voltage ceiling (150 V), run-time cap, UV shield interlock, buffer-low cutoff.
Anatomy of a physical eval
The diagram below sketches one possible anatomy for a physica eval (this might not be complete, but take it as a useful starting point).
An environment. The orchard, the bench, the cell line, the floor.
An action space the eval can verify. What a participant is allowed to do — fly the drone, dispense the reagent, move the parts — needs to be observable enough that the system can confirm what happened.
Sensors. What the eval uses to know the state of the world. Cameras, scales, thermocouples, microbiology assays, a human spot-check.
A primary metric of utility, plus secondary metrics (cost, time, resource use, energy).
Safeties and guardrails. A net to catch the drone, a kill switch, a fenced area, an interlock. Whatever ensures that a participant failing — or trying to break things — doesn’t damage the orchard or hurt the birds.
A physical eval is, by construction, a public-facing physical system that gives partial control of real hardware to whoever holds the current slot. Keeping it open without becoming dangerous — and without sacrificing utility — is the hardest layer of the stack (see Safety for physical evals).
Governance. The rules of the eval and who controls them. Who decides the primary metric and when it can change? Who can introduce external hardware or a remote-control override? Is slot time fixed or auctioned? Who arbitrates disputes, and by what process? Good governance is what distinguishes an eval that stays honest over years from one that quietly drifts to serve whoever is running it at the time.
This list is not final. Different domains will surface components not named here — calibration drift, biological containment, human-in-the-loop sign-off, regulatory constraints — and the right abstraction is going to settle as people actually build the things.
Challenges for physical evals
The following are some of the harder design problems that don’t have clean answers yet — and that any serious physical eval effort will have to confront.
Goodharting. Any eval with a numeric target invites unintended ways to hit it — with the wrinkle that in a physical eval the unintended ways can cause real-world harm. An agent optimising fruit saved might drive the deterrent so aggressively that birds and orchard workers avoid the area: the metric goes up, the orchard becomes unusable.
Non-stationarity. The physical world changes regardless. An orchard in week one is not the same orchard in week twelve — season, weather, and pest population all shift. A wet-lab bench drifts as reagent batches age. Field plots evolve. Comparing scores across time is therefore hard, sometimes impossible.
Sequential contamination. Each participant leaves a trace for the next. In a wetlab this is problematic — reagents consumed, cultures disturbed, hardware worn — but the problem is general: stock depleted in a warehouse cell, soil compacted on a field plot, bird behaviour shifted by a heavy deterrence week. Sequential slots work for environments with a natural or cheap reset; they don’t work for environments where state accumulates.
Latency as a confound. A participant operating remotely over the internet sees the environment through a sensor stream and acts through a command channel, both of which have variable latency. Two agents with identical policies but different network conditions will produce different results. This is especially visible in fast-moving environments — a drone avoiding a collision, a robot arm catching a falling object.
Observer effect. The sensors required to score an eval change what is being measured. A camera rig that watches a field plot for pest activity may deter the pests on its own. A flow sensor on a reagent line changes the thermal environment of the bench. In some domains the effect is negligible; in others it will corrupt the primary metric.
Safety for physical evals
A physical eval that anyone on the internet can operate is, by construction, a public attack surface on a real-world physical system. The participant at any given slot might be a well-behaved research team, an AI agent following a poorly-aligned policy, or a person who wants to break things on purpose. The eval has to keep working — usefully, openly, safely — across all three.
Somebody not fully trusted is about to make the drone, the sprayer, the autoclave do something for the next twenty minutes — what’s the worst that can happen, and how is it bounded?
The attack surface
A useful first pass is to categorise harms by who pays the cost:
- Harm to the eval itself. The drone crashes, the cell line dies, the robot arm jams. Cheap if the guardrails work — the operator resets, the leaderboard absorbs the failure.
- Harm to the surrounding environment. Chemicals spill, the orchard catches fire, a neighbouring field gets sprayed.
- Harm to humans. A bystander gets hit by the drone, an operator gets burned, a patient sample gets switched. The lines between these categories blur in practice — chemical drift is “environment” until a bystander walks through it. The category that matters most, and the hardest to bound.
- Information harms. Footage of bystanders or proprietary processes leaves the eval site; the eval is used as a covert surveillance platform; sensor streams are exfiltrated.
- Generation of dangerous artifacts. The wet-lab cell is steered toward synthesising something harmful; the sprayer drone is weaponised; the autoclave is used to destroy evidence.
Categories 1–3 are about what can happen during a slot. Categories 4–5 are about what can leave the eval afterwards. They want different defences, and a serious eval needs both.
Defences worth building (AI GEN)
None of the following is a finished answer. They are the moves worth physical evals trying, evaluating, and writing up:
- Time-slotting with audit. Single operator at a time, every action logged, the whole slot replayable. The slowest defence and the foundation everything else builds on.
- Action-space sandboxing. The eval enforces hard limits inside its abstraction: max chemical per slot, max motion envelope, max temperature ramp. The action space exposed to the operator is strictly smaller than the action space the hardware can physically produce.
- Dry-run validation. A submitted policy runs through a cheap simulation pass first — not as the eval itself, but as a gate. Refuses to execute on the physical system if the simulated run trips any guardrail.
- Supervised / shadow modes. Like a learner’s permit: new operators get to compute actions but not actuate them for the first N slots. New operators run in shadow mode (actions computed but not executed) for some number of slots before they’re trusted with real actuation. Progressive trust as the leaderboard accumulates evidence.
- Anomaly cut-outs. A separate monitor watches for off-distribution sensor readings, sudden command spikes, too-clever-by-half action sequences — and pulls the kill-switch before the eval owner has to.
- Open red-teaming. Each eval publishes its threat model and invites external researchers to attack it. The right way to find the holes is to invite people to look.
- Skin in the game. Operators bond a small amount per slot, refundable on clean completion, forfeited if an audit finds violation. Aligns incentives without requiring trust upfront.
Most of these are borrowed from adjacent fields — public cloud security, scientific-facility time-sharing (telescope nights, beamline schedules), bug-bounty programs, robotics-safety standards. None of them have been worked out in detail for a public, openly-instrumented physical system that AI agents are also supposed to operate. That’s a research agenda in itself.
An open movement for physical evals
Physical evals could become an open-source ecosystem: environments cheap to set up, easy to fork, open to anyone with a problem worth measuring.
The rest of this section traces the arc: where things have been (Prior art), what an open ecosystem looks like in practice (Open at every layer), what it would have to cost (What’s the Raspberry Pi of a physical eval?).
Prior art
Physical-world AI competitions aren’t new. The DARPA Grand Challenge put autonomous vehicles in the Mojave; the DARPA Robotics Challenge put humanoids through disaster-response courses; the Amazon Picking Challenge ran in warehouse mock-ups for several years; RoboCup has been running its soccer leagues since 1997 — arguably the longest-lived physical eval in continuous operation, and the one with the most literature on what makes it work and what it ends up measuring. RoboCup has been doing its soccer leagues since the late 1990s; the Indy Autonomous Challenge and Roborace have put driverless cars on real circuits.
What these have in common: each was (or is) a sponsor-led, time-limited event with closed protocols and bespoke infrastructure. They produced brilliant moments and a small library of papers; they were expensive to build and harder to reproduce.
What’s the Raspberry Pi of a physical eval?
The hard constraint on all of this is cost. A DARPA-class eval needs millions of dollars and a multi-year program; even a modest research-grade one runs into expensive sensors, networking, fail-safe hardware, and the human labour to keep it operating. That ceiling is what makes physical evals rare today — and rare evals can’t be the basis of an ecosystem.
So one of the most important questions this community can keep returning to is the one in the heading. Stand-in for “the cheapest plausible build”. The Raspberry Pi did this for hobbyist computing; what’s the equivalent for physical evals? What’s the bill of materials that brings a credible, instrumented, openable physical eval down to the cost of a serious hobby project? Probably some mix of commodity sensors, a single-board computer for the control loop, an open scheduling service for time-share, off-the-shelf safety hardware, and a reference orchestration stack that everyone forks. If the answer ends up being “a few hundred dollars and a weekend,” the ecosystem can actually form. If it stays at “a few hundred thousand and a six-month build,” it stays a fantasy.
Open at every layer
In order to further lower the cost for anyone to be able to set up (safely) their physical evals, we need to look at off-the-shelf hardware and an open source stack.
- Open protocols. The spec of an eval (environment, action space, sensors, metric, secondaries, guardrails) is published as a forkable document, the same way a research benchmark is published.
- Open hardware. Sensor rigs, mechanical setups, fail-safe systems default to off-the-shelf components, with reproducible bills of materials and CAD files.
- Open software. Time-share scheduling, telemetry capture, scoring, auditing — shared infrastructure, not a one-off codebase per eval.
- A community around it. People running, replicating, and forking each other’s evals; people contributing sensor stacks and guardrail designs; people maintaining the scoring code together. No single lab can stand up enough physical evals to cover the interesting surface of physical problems — a community can.
Public verifiability
The safety section above focuses on protecting the physical environment from adversarial participants. There is a symmetric problem that gets less attention: protecting participants — and the public — from adversarial eval runners.
In an open world where anyone can wire a field, a lab bench, or a warehouse cell to the internet and declare it a physical eval, the operator controls the sensors, the scoring pipeline, and in a way, the ground truth. A dishonest operator can inflate results for a preferred team, suppress evidence of harm, or fabricate the physical record entirely. If physical evals are going to carry weight — as procurement signals, safety certifications, or policy inputs — the data they produce has to be trustworthy independent of whether the runner is trustworthy.
This is a valuable research direction in its own right. Some threads worth pulling:
- Tamper-evident sensors. Hardware-attested video streams that can be verified as unedited after the fact — the physical analogue of a signed log.
- Trusted execution environments. Running the scoring pipeline inside a TEE means the operator cannot modify results without breaking the attestation, even if they control the host machine.
- Cross-checking sensor redundancy. Multiple independent sensor modalities covering the same physical event make coordinated fabrication harder: a weight sensor, a camera, and an RFID log all have to agree.
- Third-party witnesses. Spot audits by an independent party — human or automated — who can access raw sensor streams without going through the operator’s pipeline.
Combinations are likely to be necessary, and the right combination will vary by domain. What a wet-lab needs to prove that a synthesis actually ran differs from what an orchard needs to prove that a drone actually flew a slot. Building this layer — call it physical eval verification — is at least as important as building the evals themselves to make the results publicly verifiable.
A Darwinian ecosystem
Not every physical eval will be a good one. Some will be hard, some trivial. Some will be well-structured; some will be a mess. Some will scale to many participants; some will only ever host one team at a time. That’s fine — even desirable. The shape of “what makes a good physical eval” is going to emerge from people building them, breaking them, and learning what each one actually measured.
Sketched, a public registry for such an ecosystem might look like this:
More than evals
Every execution of a physical eval produces something beyond a score: a timestamped record of sensor readings, actions taken, and outcomes observed, all under conditions that were defined in advance and held constant across participants. That record has value on its own.
The most immediate use is data collection. A team that runs an agent on the orchard eval for a week doesn’t just get a leaderboard position — they accumulate labelled trajectories in a real environment that would be expensive to stage deliberately. Even failed attempts are informative: a drone that misses a bird on Tuesday has documentation of exactly what the environment looked like and what the agent did.
For some categories of problem the step further is worth considering: repurposing the eval environment as a training environment. The orchard is already instrumented. The slot system already handles scheduling. If the cost of running episodes is low enough — bird-deterrence is essentially free to attempt, wet-lab synthesis is not — the same infrastructure can run RL rollouts between evaluation windows. The environment that scores a model on Monday can help train the next version by Friday.
This doesn’t collapse the distinction between training and testing. Eval integrity still requires held-out conditions, independent scoring, and participants who didn’t design the environment. But the hardware doesn’t have to be idle between eval slots, and the data generated during evaluation doesn’t have to be discarded. For operators willing to share trajectories under open licences, a physical eval site becomes something closer to a living dataset — one that grows richer every time a new agent takes a slot.
Physical evals as a market
Why should one set up a physical eval? Setting up a physical eval can be a way to crowdsource intelligence for an unsolved problem.
The eval ecosystem can act as a market for intelligence: any participant — human, agent, team, company, hobbyist — can take a slot, attempt to saturate the metric, and submit. The leaderboard answers the AI-selection question by revealing whose approach actually delivers on the physical world.
Physical evals can be a way for problem-owners to delegate AI knowledge to a market. This is the same shift that happened with bug bounties. A company didn’t have to predict who the best vulnerability researchers were; they had to publish the surface and the rules, and the market sorted itself out. The grower doesn’t pick a model. The hospital doesn’t pick a model. The factory doesn’t pick a model. They pick a problem worth instrumenting and let the world’s AI builders compete to be the answer. As more domains follow suit, an aggregate picture emerges of where AI is actually good.
Get in touch
If any of this resonates, please write. Three good reasons:
- Already working in this space. Compare notes — what’s been learned about sensing, guardrails, or keeping a system honestly open will save the next person a lot of time.
- Have a physical problem worth instrumenting as an eval. Worth thinking through together — what to measure, how to keep it safe to open up, how to make it interesting enough that people show up to compete.
- Have an eval to propose. Even if hosting isn’t feasible right now, good proposals are valuable — they’re what an ecosystem of physical evals is made of.
DM @iamnotnicola on X.
Let’s turn more of the physical world into something AI can be measured against — and use that to point AI at problems that actually matter.
Acknowledgements
This was written by Nicola Greco with support of AI. It was brainstormed as part of ARIA’s Scaling Trust programme, in collaboration with Alex Obadia.