*World Archive technical essay · June 2026*
When a frontier robotics team scopes an egocentric corpus, the procurement slide usually leads with hours of video. That metric is easy to count and hard to defend. A five-minute factory clip with no temporal labels, no hand tracks, and no object IDs is not five minutes of training signal — it is five minutes of preprocessing debt. What moves imitation-learning and VLA fine-tuning is annotation density: how many supervised decision boundaries you get per minute of footage, and how many modalities align on each boundary.
This essay argues that density — not raw duration — is the variable that separates corpora teams can fine-tune on Monday from corpora they spend a quarter converting. We ground the claim in public benchmarks, then show how the Mono India Workplace sample stacks a full label contract on top of that density.
Hours are the wrong headline metric
Industry-scale ego datasets optimize for reach. Ego4D ships thousands of hours with rich narration and benchmark tasks, but manipulation supervision is often narration-heavy and temporally coarse — excellent for video understanding, uneven as a source of short-horizon policy phases without additional segmentation work. Build AI's Egocentric-100K advertises massive video scale; public-facing material emphasizes collection throughput, not dense per-frame manipulation labels. You get volume; you do not get a ready-made IL interface.
Kitchen-centric corpora tell a different story. EPIC-Kitchens annotators working dense activity in familiar environments routinely produce on the order of 80–100 verb–noun segments in five minutes of active cooking — a high bar for temporal granularity. That is the right comparison class for manipulation: segments per minute, not terabytes on disk.
World Archive's Mono sample is small by design — nine clips, ~48 minutes — but 218 human-reviewed action segments across that runtime. That is ~4.5 segments per minute corpus-wide, with individual clips ranging from ~17 to ~29 segments per five-minute unit depending on task complexity (ironing and catering run denser than long-cycle paint prep). Median segment length is ~8 seconds — short enough that each row names a single manipulation phase ("insert shuttle tube," "press iron," "mix batter"), long enough to contain observable hand–object state change.
Same five minutes of video. Twenty-five verb–noun boundaries versus zero. Per-frame 21-point hand tracks and object boxes on top. The training utility is not linear in duration; it scales with labeled decision density.
What density buys you in a policy stack
Consider two onboarding paths for a VLA team:
Path A — raw factory ego video at industry scale. Sprint zero writes segmenters or pays annotators to recover phase boundaries. Hand pose is re-estimated; object tracks are bootstrapped; contact is inferred; consent and QA are unknown. Every experiment starts with a custom converter.
Path B — the same five minutes with dense labels already aligned. Each segment carries start/end seconds and a combined task string. Hand keypoints exist per frame with explicit `source: estimated_2d`. Object boxes arrive at ~1 Hz with track IDs and segment context. Hand boxes and hand–object contact are derived but versioned. Metadata records blur, manipulation density, CI pass/reject, device, and mount. Consent for commercial AI training is documented before capture.
Path B does not eliminate modeling work — no public mono corpus is sim-ready or force-labeled — but it removes the alignment tax. Losses can weight human segment boundaries differently from model boxes. Task strings on LeRobot frames come from the covering segment at each timestamp. You train instead of parse.
That is the density argument in one line: one minute of multi-layer labels beats ten minutes of unlabeled ego video for IL/VLA fine-tuning, because gradient steps need named phases, hand state, and object referents — not just pixels.
The stack exists to multiply density, not decorate it
We ship eight annotation layers on Mono. They are not a checklist; each layer adds per-minute supervisory surface:
1. Action segments (human) — primary temporal density; ~8s average phase length. 2. Captions (human) — clip-level intent for VLM conditioning and diligence. 3. 21-point hand keypoints (model) — dense per-frame end-effector proxies. 4. Object bounding boxes (model + rules) — ~1 Hz boxes with labels, track IDs, segment context. 5. Hand boxes (derived) — ROI-friendly hand regions without re-deriving landmarks. 6. Hand–object contact (derived) — overlap-based grasp/affordance timing, flagged as geometric proxy. 7. Metadata + QA (human + automated) — filter before training; manipulation-density scores per clip. 8. Consent + delivery manifest (human legal) — commercial training scope attached to the distribution.
Provenance is explicit: human versus model versus derived. Teams can audit regressions, scope losses, and report benchmarks without a "100% annotated" badge that collapses under diligence.
We do not claim metric 3D, torque sensing, or dense per-frame COCO. Object boxes are sparse by honest design. Contact is geometric, not force-based. Density here means temporal phases plus aligned modalities, not fake fullness.
Benchmark snapshot (segments per five minutes)
| Corpus | Typical scale | Manipulation segment density (order of magnitude) | Per-frame hands / objects (public) |
|---|---|---|---|
| Ego4D | Thousands of hours | Coarse for manipulation; narration-first | Partial subsets, task-dependent |
| Build AI Egocentric-100K | 100K+ hours (video-first) | Minimal public temporal labels | Not shipped as dense IL stack |
| EPIC-Kitchens (dense kitchen) | ~100 hours class | ~80–100 / 5 min in active segments | Hands + objects in research releases |
| World Archive Mono sample | ~48 min (9 clips) | ~17–29 / 5 min per clip | Full stack on every clip |
We are not competing on hours. We are competing on labels per minute you can load without a conversion project — and on real-economy workplaces (factories, catering, auto repair) that kitchen corpora under-sample.
From density to code
Metadata and segments on Hub:
from datasets import load_dataset
clips = load_dataset("WorldArchive/mono-india-workplace-sample", "clips", split="train")
segments = load_dataset("WorldArchive/mono-india-workplace-sample", "segments", split="train")
print(len(segments), "segments across", sum(c["duration_sec"] for c in clips) / 60, "minutes")Policy-oriented tensors with segment-aligned task strings:
from lerobot.datasets.lerobot_dataset import LeRobotDataset
ds = LeRobotDataset("WorldArchive/mono-india-workplace-lerobot")Browse previews in the Dataset Viewer or the data-explorer Space.
Closing
Raw video hours are a supply-chain metric. Annotation density is a training metric. Frontier labs do not lack footage; they lack aligned phases, hands, objects, and consent on the minutes that matter. The Mono sample is a stress test: nine clips, 218 segments, ~8 seconds average phase length, eight layers deep — enough to tell in an afternoon whether your stack ingests real-economy manipulation or needs another quarter of labeling.
Questions, commercial training licenses, or Pro-tier capture specs: shubham@worldarchive.co · book a call.