Sequent runs long-horizon tasks as verifiable procedures, checking every step against the robot's own balance, reach, and contact, and re-planning the moment a step is physically impossible.
23 DOF
Support polygon
A plan can say "pick up the box." Only the body knows whether that means a squat, a step, or a fall.
Planning and control stay distinct. Each layer speaks a typed, checkable contract to the one below.
Tasks grounded into ordered, typed steps with explicit pre and post conditions.
Every post-condition is checked against whole-body feasibility: balance, support polygon, reach.
A frozen low-level policy keeps the humanoid balanced at high frequency.
Free-text planners ask a model if a step worked. Sequent asks the body. A post-condition is a predicate over real balance, contact, and reach. When a step is infeasible, the system re-plans at the right layer.
Everything below is the real AMO-controlled G1 in MuJoCo: a calibrated number, a predictor, and the closed loop preventing a fall.

Each dot is one rollout of the real controller across a payload x reach grid. A single balance-margin threshold separates every configuration where the robot fell from every one where it stood.
Simulation (MuJoCo) · static feasibility check · proof-of-concept grid · not yet hardware
Typed SOPs with explicit pre and post conditions, not free text.
Post-conditions checked against balance, support polygon, and reach.
Low-level whole-body control as a frozen, reusable base.
When a step is impossible, re-plan instead of pushing on.
One RL policy learning to pick a screwdriver off the workbench — captured the moment training started, and again at its best so far.
The arm searches near the bench but never closes on the tool. Episode return ≈ 60.
Exploration finds the skill: real contact — no grabbing across a gap — and the screwdriver leaves the bench. Episode return ≈ 8,800.
The mean policy — zero exploration noise — walks its hand straight to the tool, latches on real contact and lifts. 40% grasp / 30% full-lift over 20 random spawns, 0 falls.
We scaled the same reward 25× on cloud compute and the policy got better at the score and worse at the job: it grasps the tool (45%, our best yet), then circles its hand above the bench and never lifts. Our penalties made lifting unprofitable, so it farms the grasp and skips the work — "successful" episodes score −4,000. More compute didn't fix the task; it found the flaw faster. This is the failure mode our whole system exists to catch: the score said better, the physics said worse, and only one of them is telling the truth.
$ run_task --command "pick up the screwdriver" # real skill (v5.5) PLAN — pick up the screwdriver [0] pick(obj=screwdriver) TASK COMPLETE [OK] pick — verified | lift held 0.5s + stable $ run_task --command "pick up the screwdriver" # reward-hacked policy PLAN — pick up the screwdriver [0] pick(obj=screwdriver) TASK HALTED [XX] pick — postcondition_failed (x3) object_lifted: measured=0.018m vs 0.050m (sustained 0/25)
We stopped patching the reward and built the layer that was always the point: every skill now carries machine-checked pre- and postconditions, and a verifying executor runs the plan step by step — a step only counts if the physics agrees it happened. The real policy completes the task; the reward-hacked one is caught grabbing-without-lifting, retried, and halted with the exact reason. No silent success. This is what it takes to trust a humanoid on a factory floor.
Real rollouts, MuJoCo · deterministic evals, zero exploration noise · the verifier measures every claim against the simulator state — it passes the real skill (55%) and rejects the cheat (0%) · next: wire the SOP-retrieval brain into the executor and add the walk + button skills.

For research collaboration, or to follow the work as it develops.
jsikka@utexas.edu