Essays

World Archive · June 2026

The density advantage: labels per minute that actually train policies

Why annotation density beats raw video hours for IL and VLA fine-tuning.

*World Archive technical essay · June 2026*

When a frontier robotics team scopes an egocentric corpus, the procurement slide usually leads with hours of video. That metric is easy to count and hard to defend. A five-minute factory clip with no temporal labels, no hand tracks, and no object IDs is not five minutes of training signal — it is five minutes of preprocessing debt. What moves imitation-learning and VLA fine-tuning is annotation density: how many supervised decision boundaries you get per minute of footage, and how many modalities align on each boundary.

This essay argues that density — not raw duration — is the variable that separates corpora teams can fine-tune on Monday from corpora they spend a quarter converting. We ground the claim in public benchmarks, then show how the Mono India Workplace sample stacks a full label contract on top of that density.

Hours are the wrong headline metric

Industry-scale ego datasets optimize for reach. Ego4D ships thousands of hours with rich narration and benchmark tasks, but manipulation supervision is often narration-heavy and temporally coarse — excellent for video understanding, uneven as a source of short-horizon policy phases without additional segmentation work. Build AI's Egocentric-100K advertises massive video scale; public-facing material emphasizes collection throughput, not dense per-frame manipulation labels. You get volume; you do not get a ready-made IL interface.

Kitchen-centric corpora tell a different story. EPIC-Kitchens annotators working dense activity in familiar environments routinely produce on the order of 80–100 verb–noun segments in five minutes of active cooking — a high bar for temporal granularity. That is the right comparison class for manipulation: segments per minute, not terabytes on disk.

World Archive's Mono sample is small by design — nine clips, ~48 minutes — but 218 human-reviewed action segments across that runtime. That is ~4.5 segments per minute corpus-wide, with individual clips ranging from ~17 to ~29 segments per five-minute unit depending on task complexity (ironing and catering run denser than long-cycle paint prep). Median segment length is ~8 seconds — short enough that each row names a single manipulation phase ("insert shuttle tube," "press iron," "mix batter"), long enough to contain observable hand–object state change.

Same five minutes of video. Twenty-five verb–noun boundaries versus zero. Per-frame 21-point hand tracks and object boxes on top. The training utility is not linear in duration; it scales with labeled decision density.

What density buys you in a policy stack

Consider two onboarding paths for a VLA team:

Path A — raw factory ego video at industry scale. Sprint zero writes segmenters or pays annotators to recover phase boundaries. Hand pose is re-estimated; object tracks are bootstrapped; contact is inferred; consent and QA are unknown. Every experiment starts with a custom converter.

Path B — the same five minutes with dense labels already aligned. Each segment carries start/end seconds and a combined task string. Hand keypoints exist per frame with explicit `source: estimated_2d`. Object boxes arrive at ~1 Hz with track IDs and segment context. Hand boxes and hand–object contact are derived but versioned. Metadata records blur, manipulation density, CI pass/reject, device, and mount. Consent for commercial AI training is documented before capture.

Path B does not eliminate modeling work — no public mono corpus is sim-ready or force-labeled — but it removes the alignment tax. Losses can weight human segment boundaries differently from model boxes. Task strings on LeRobot frames come from the covering segment at each timestamp. You train instead of parse.

That is the density argument in one line: one minute of multi-layer labels beats ten minutes of unlabeled ego video for IL/VLA fine-tuning, because gradient steps need named phases, hand state, and object referents — not just pixels.

The stack exists to multiply density, not decorate it

We ship eight annotation layers on Mono. They are not a checklist; each layer adds per-minute supervisory surface:

1. Action segments (human) — primary temporal density; ~8s average phase length. 2. Captions (human) — clip-level intent for VLM conditioning and diligence. 3. 21-point hand keypoints (model) — dense per-frame end-effector proxies. 4. Object bounding boxes (model + rules) — ~1 Hz boxes with labels, track IDs, segment context. 5. Hand boxes (derived) — ROI-friendly hand regions without re-deriving landmarks. 6. Hand–object contact (derived) — overlap-based grasp/affordance timing, flagged as geometric proxy. 7. Metadata + QA (human + automated) — filter before training; manipulation-density scores per clip. 8. Consent + delivery manifest (human legal) — commercial training scope attached to the distribution.

Provenance is explicit: human versus model versus derived. Teams can audit regressions, scope losses, and report benchmarks without a "100% annotated" badge that collapses under diligence.

We do not claim metric 3D, torque sensing, or dense per-frame COCO. Object boxes are sparse by honest design. Contact is geometric, not force-based. Density here means temporal phases plus aligned modalities, not fake fullness.

Benchmark snapshot (segments per five minutes)

CorpusTypical scaleManipulation segment density (order of magnitude)Per-frame hands / objects (public)
Ego4DThousands of hoursCoarse for manipulation; narration-firstPartial subsets, task-dependent
Build AI Egocentric-100K100K+ hours (video-first)Minimal public temporal labelsNot shipped as dense IL stack
EPIC-Kitchens (dense kitchen)~100 hours class~80–100 / 5 min in active segmentsHands + objects in research releases
World Archive Mono sample~48 min (9 clips)~17–29 / 5 min per clipFull stack on every clip

We are not competing on hours. We are competing on labels per minute you can load without a conversion project — and on real-economy workplaces (factories, catering, auto repair) that kitchen corpora under-sample.

From density to code

Metadata and segments on Hub:

from datasets import load_dataset

clips = load_dataset("WorldArchive/mono-india-workplace-sample", "clips", split="train")
segments = load_dataset("WorldArchive/mono-india-workplace-sample", "segments", split="train")
print(len(segments), "segments across", sum(c["duration_sec"] for c in clips) / 60, "minutes")

Policy-oriented tensors with segment-aligned task strings:

from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset("WorldArchive/mono-india-workplace-lerobot")

Browse previews in the Dataset Viewer or the data-explorer Space.

Closing

Raw video hours are a supply-chain metric. Annotation density is a training metric. Frontier labs do not lack footage; they lack aligned phases, hands, objects, and consent on the minutes that matter. The Mono sample is a stress test: nine clips, 218 segments, ~8 seconds average phase length, eight layers deep — enough to tell in an afternoon whether your stack ingests real-economy manipulation or needs another quarter of labeling.

Questions, commercial training licenses, or Pro-tier capture specs: shubham@worldarchive.co · book a call.