Multimodal Data Collection for Home Robotics: A Practical Demonstration with Laundry Sorting
We present a novel approach to collecting multimodal data—integrating vision, language, and action—using GoPro-mounted grippers for robotic learning.
Our methodology demonstrates a scalable, low-cost pipeline for capturing and structuring datasets that align human-like explanations with real-world visual and manipulation tasks.
Focusing on the domain of laundry sorting, we showcase how real-world data can be used to develop robust, generalizable robotic policies for household automation.
This study highlights the value of such datasets for embodied AI and addresses challenges in data alignment and scalability.
Robotic Gripper & Camera: GoPro Hero 10 mounted on a custom 3D-printed gripper.
Field of View: Wide fisheye lens ensures comprehensive visual coverage.
Contextual Capture: Static third-person cameras to provide workspace context.
Data is collected for a series of laundry sorting sessions, where items were separated into “white” and “colored” categories. Each session involved:
Gripper view recording: The GoPro captured first-person perspectives of the task, simulating the robot’s visual input.
Scene view recording: A third-person camera recorded the entire workspace, providing contextual information about the gripper’s movements and interactions.
Narration: Verbal explanations were provided in real-time, describing each decision (e.g., “This is a white shirt, so it goes in the white pile”).
Audio recordings are transcribed by human annotators and aligned precisely with video frames:
Annotations include:
Contextual explanations of sorting decisions.
Exact timestamps marking the start of each verbal comment.
Action intervals corresponding to video segments.
This structured dataset enables detailed multimodal analysis, combining visual, linguistic, and action data.
Our dataset provides comprehensive multimodal data designed explicitly for training robotic systems in household tasks:
Videos: 5 000
Hours: 1 000
Instructions: 50K
Supported Languages: 20
Home robots must adapt to diverse households. Multilingual data ensures our models understand and act on real, everyday language — not just English.
Our system is designed to support multilingual training, including Arabic and Spanish, reflecting real-world diversity and accessibility in household environments.
Selecting a Lightweight 6/7-DOF Robotic Arm for the Go1 Vision-Based Manipulation System
Unitree Z1
Kinova Gen3
FRANKA RESEARCH 3
Universal Robots UR10e
Degrees of Freedom (DOF): 6
Payload Capacity: 3–5 kg (Z1 Pro model)
Reach: 740 mm
Weight: 4.3 kg
Repeatability: ±0.1 mm
Degrees of Freedom (DOF): 6
Payload Capacity: 2 kg (full-range continuous), 4 kg (mid-range continuous)
Reach: 891 mm
Weight: 7.2 kg
Repeatability: ±0.1 mm
Degrees of Freedom (DOF): 7
Payload Capacity: 3 kg
Reach: 855 mm
Weight: ~18 kg
Repeatability: ±0.1 mm
Degrees of Freedom (DOF): 6
Payload Capacity: 12.5 kg
Reach: 1300 mm
Weight: 33.5 kg
Repeatability: ±0.05 mm