Simulated humanoid loco-manipulation

A ticket, a plan, a robot.

An operator describes a plant-floor incident in plain language. A fine-tuned retriever finds the matching procedure among 1,012 SOPs, a planner turns it into a sequence of skills, and a simulated Unitree G1 runs that sequence in one continuous simulation: walk, pick, carry, press, lever. A verifier checks each step against measured simulator state and reports any step the physics does not confirm as failed.

Jatin Sikka · Sequent Robotics · 2026

Paper ↗ Code ↗ ▶ Training log

One continuous simulation · walk → pick → carry 2.8 m → place → press → restall criteria pass

One continuous simulation, no scene cuts. The box is not dropped and the robot does not fall. The RL press reaches 34.5 mm with no re-pressing (PUMPS=0), and every criterion is measured during the run.

01 · Overview

System overview.

The system has two parts joined by a typed contract: a brain that turns language into a sequence of skills, and a body, a Unitree G1 with an added gripper in MuJoCo, that executes it. A verifier checks every precondition and postcondition from simulator state rather than from the policy’s own report. It measures the outcome whether a controller or a learned policy produced the motion.

Pipeline: ticket into a fine-tuned bi-encoder over 1,012 SOPs, faithful planner, verifier/executor measuring from mjData, into the G1 body skills (walk, pick, press, lever)

02 · The brain

Retrieval over 1,012 SOPs.

The benchmark has 1,012 SOPs across 46 equipment families with within-family hard negatives, and 2,338 incidents written in operator language that avoids the SOP’s title words. On the held-out test, the fine-tuned MiniLM bi-encoder scores 0.584 R@1 against 0.301 for keyword search. Two negative results also held: a larger from-scratch bert-base retriever did not beat TF-IDF, and an off-the-shelf cross-encoder reranker lowered R@1 from 0.584 to 0.442, so it is not used.

Method

R@1

R@5

MRR

TF-IDF (lexical baseline)

0.301

0.535

0.408

Fine-tuned MiniLM bi-encoder (ours)

0.584

0.851

0.697

+ off-the-shelf reranker

0.442

0.833

0.604

Full 1,012-SOP index, 269-incident held-out test, reproduced locally. The reranker lowered R@1 from 0.584 to 0.442 and is not used.

Grouped bar chart of R@1, R@5, MRR for TF-IDF, the fine-tuned bi-encoder, and the bi-encoder plus reranker

03 · The skills

Controllers for kinematic skills, RL for contact skills.

The split follows from the results in section 05. For geometric tasks such as walking to a pose or servoing a gripper, a controller is used. For contact-dominated tasks such as a spring-loaded button or a latched lever, RL is used.

Pick

Controller · DLS-IK + force-gated latch

walk → pick11.9 cm held lift

The latch engages only when measured pad forces confirm contact on both sides, so it cannot grab across a gap. 11 of 12 held lifts in validation.

Press

RL · PPO + success-gated curriculum

walk → press8/8 det · PUMPS=0

A learned reach of about 19 cm from the rest pose, with no IK seed. The press reaches 29 to 40 mm and holds, deterministic across 8 of 8 starts. Pressing once is enforced by ending the episode on success.

Lever (partial)

RL · BC → DAgger → PPO + entropy crush

lever · diagnostic~50% · jerky

The motion is jerky and reliability is about 50%. The cause is a per-episode arm-reach bias the policy cannot observe. The fix is identified but not yet retrained.

Ticket loop

ticket → retriever → faithful plan → robot

G7 · ticket → walk + pressSOP rank 1

A plain-language ticket goes through the trained 1,012-SOP retriever, the planner produces the sequence, and the robot executes it. Steps the robot cannot perform, such as wait, read, or notify, are shown as captions.

04 · The full run

One ticket, five actions.

An SOP written for this demo is retrieved at rank 1 over the full corpus (0.897) and its five-action plan runs in one take. The recording below is take 3 of 3, which is the weakest of the three. Pick and the 2.8 m carry pass. The place drifts, the press misses, and the lever springs back, and the verifier reports each of these as failed.

Finale · ticket → brain → robot · one takeverdict: partially resolved

Retrieval rank 1 (0.897); pick and carry pass. In this take the place drifted, the press reached 0.0 mm, and the lever moved from 0.636 to 0.953 rad against a success threshold below 0.25 rad. Take 1 passed pick, place, and a 33.8 mm press and failed only the lever.

05 · What failed and why

RL grasping and reward hacking.

RL grasping did not work and was replaced by a controller. Across about 10 reward variants, deterministic verified success was 0. Training metrics reported 30 to 60% “lift”, but the videos showed the box being squeezed out of the grip and moving upward, with the height check triggering while the box was airborne. The replacement is an IK plus force-gated-latch controller, which held 11 of 12 lifts. RL is used for the press and lever instead. Each of the exploits below was found by watching the rendered rollout, not from a metric.

parking

Proximity reward near the button led the policy to hover 5 cm short and collect that reward indefinitely. Fix: narrow the band from 0.20 to 0.05 m and double the distance penalty.

seat farming

The lever’s seat gives about 0.17 rad of free travel, and a linear progress reward paid the policy for sitting in it. Fix: re-baseline the reward past the seat and make progress convex.

pumping

A per-step depth reward made re-pressing profitable: press, release, press again. Fix: end the episode on success.

air-swinging

With pressing-once in place, motion after the press carried no penalty, so the arm kept moving while the hold counter ran. Fix: a post-press stillness penalty and a dip-tolerant hold counter.

flailing

The anti-jerk penalty was out-earned by fast motion that still bumped the button (jerk 0.83). Fix: a low-pass filter on the command, which constrains the motion directly (jerk 0.001).

ballistic pop

Any instantaneous lift signal is satisfied by launching the box. Fix: removed RL; the verifier uses a sustained held-state check instead.

Each of these was found by watching the rendered rollout, not by a metric.

06 · Limitations

Current limitations.

Simulation only. Everything runs in MuJoCo. No hardware transfer has been attempted.
Lever is a partial skill. About 50% reliable across stances and visibly jerky. The fix, adding the arm-reach bias to the observation, is designed but not yet retrained.
The finale SOP was written for this demo. Retrieval runs over the full corpus, but that one procedure was written to use all five skills.
Single-generator corpus. The SOPs and incidents share one generation process, so real operator language will differ from the held-out test.

07 · Reference

Paper.

The full write-up covering the system, the retrieval benchmark, the RL results, and the complete list of reward hacks is in the repo. All numbers in it were measured by us.

Read the paper ↗ Training log →

@misc{sikka2026sequent, author = {Jatin Sikka}, title = {Sequent: SOP-Driven, Physics-Verified Loco-Manipulation on a Simulated Humanoid}, year = {2026}, note = {https://sequentrobotics.com} }