Hand-Steer Sim — Real-Time Gesture Teleoperation for Mobile Robots

1. Overview & Motivation

Hand-Steer Sim is a gesture-based robot teleoperation system that uses a simple webcam (or RealSense) to control differential-drive robots in real time—no joystick required.

It combines MediaPipe hand tracking with lightweight neural models to recognize both static commands (e.g., Stop, Speed Up) and steering gestures (e.g., Turn Left). Outputs are published to /cmd_vel for use in Gazebo or real robots.

This project was developed as a solo capstone for EE417: Computer Vision at Sabancı University (Spring 2025) with the goal of making robot teleop:

Cheaper – no special hardware
Smarter – hybrid gesture control
Faster – 13 ms latency on GPU
Reproducible – Docker & training notebooks included ```

2. System Design

Hand-Steer Sim is structured as a modular ROS Noetic pipeline, with each major component—vision input, gesture inference, command fusion, and velocity output—implemented as an independent ROS node. This modularity allows easy deployment, testing, and integration with both simulation and real robots.

Gesture-to-Velocity Pipeline

Camera Input – Streams 960×540 RGB frames at 30 FPS
Landmark Extraction – MediaPipe Hands detects 21 hand keypoints per frame
Dual-Branch Inference:
- Static MLP → detects one-shot gestures:
  Stop, Holding Wheel, Speed Up, Speed Down
- Dynamic LSTM → processes 16-frame MCP trajectories for:
  Turn Left, Turn Right, Forward
Gesture Fusion – Steering gestures only apply when Holding Wheel is detected
ROS Mapping – Translates gestures into geometry_msgs/Twist velocity commands
Actuation – Commands sent to Gazebo or a real robot via /cmd_vel

Figure – Full architecture showing ROS nodes and topic flow.

ROS Node Breakdown

Node	Role	Publishes / Subscribes
`hsim_camera_pub`	Streams camera frames	`/image_raw`
`hsim_steer_sign`	Landmark detection + MLP/LSTM inference	`/gesture/static`, `/gesture/dynamic`
`hsim_wheel2twist`	Gated fusion + velocity output	`/cmd_vel`
`gazebo_ros_control`	Simulated diff-drive robot controller	`/cmd_vel` input

Launching the System

roslaunch hand_steer_sim sign_control.launch \
           control_mode:=steering \
           show_image:=true \
           use_gpu:=true

This command starts the full loop:

Webcam or RealSense input
Gesture recognition
Fusion and velocity mapping
Gazebo simulation (optional)

You can toggle between static-only or steering control via control_mode.

3. Models & Data

Hand-Steer Sim uses a dual-branch neural architecture to recognize two gesture types:
Static gestures from a single frame, and
Dynamic gestures from short motion trajectories.

Each branch is optimized for speed and low memory, allowing real-time deployment on CPU or GPU—even in simulation.

Neural Models

Static Gesture Classifier (MLP)

Input: 42-D vector (21 hand landmarks × 2D, wrist-normalized and scaled)
Classes: Stop, Holding Wheel, Speed Up, Speed Down
Architecture: 20 → 10 → 4 neurons
Size: ~1.1k parameters → 4.4 kB (TFLite, FP16)

Dynamic Steering Classifier (LSTM)

Input: 128-D vector (16 frames × 4 MCP joints × 2D)
Classes: Turn Left, Turn Right, Forward
Architecture: LSTM(32) → Dense(32) → 3-class Softmax
Size: ~6.4k parameters → 25 kB (TFLite, FP16)

Inference latency:

GPU (RTX 4060 Ti): ~8.5 ms total (both branches)
CPU (ThinkPad E14): ~20 ms total

Gesture Types

Branch	Gesture	Purpose
Static	Stop	Freeze all movement
	Holding Wheel	Enable dynamic gestures
	Speed Up	Increment linear velocity
	Speed Down	Decrement linear velocity
Dynamic	Turn Left	Adjust angular velocity (+)
	Turn Right	Adjust angular velocity (–)
	Forward	Maintain direction (no turn)

Note: All dynamic outputs are gated by the Holding Wheel gesture.

Dataset Overview

Collected using a custom GUI that overlays landmarks and allows fast class labeling.
All data was captured using a RealSense D435 or webcam at 960 × 540 @ 30 FPS.

Gesture Type	Classes	Total Samples	Train / Val / Test Split
Static	4	~4,600	60% / 15% / 25%
Dynamic	3	~1,700	60% / 15% / 25%

Figure – Real-time recording interface used for data collection.

4. Results & Performance

Hand-Steer Sim was benchmarked for gesture accuracy, latency, and real-time responsiveness under both CPU and GPU conditions. Both classifiers were tested on held-out test sets with no data leakage across splits.

Accuracy (Test Set)

Static MLP – 4-class classification

Accuracy: 99.65%
Macro-F1: 1.00

Dynamic LSTM – 3-class classification

Accuracy: 99.77%
Macro-F1: 1.00

Confusion Matrix — Static MLP

| True \ Pred | Stop | Hold | Up | Down | |————-|—–:|—–:|—:|—–:| | Stop | 464 | 0 | 0 | 0 | | Hold | 0 | 125 | 0 | 0 | | Up | 0 | 0 | 279 | 0 | | Down | 0 | 0 | 1 | 287 |

Confusion Matrix — Dynamic LSTM

| True \ Pred | Left | Right | Forward | |————-|—–:|——:|——–:| | Left | 128 | 0 | 0 | | Right | 0 | 126 | 0 | | Forward | 0 | 1 | 174 |

Latency Benchmarks

All timing was measured on 960 × 540 input frames at 30 FPS.

Platform	MediaPipe + Inference	Full Loop (E2E)	FPS
GPU (RTX 4060 Ti)	8.6 ms	13.2 ms	75
CPU (ThinkPad E14)	20.1 ms	25.0 ms	39
CPU + Gazebo Sim	94.1 ms	109.6 ms	9

All gestures were recognized well within real-time limits.
Temporal smoothing (majority voting over 16 frames) ensured stable behavior.

Observations

Forward gesture was hardest to maintain continuously due to its symmetry; proposed dead-zone filtering may improve robustness.
Steering gestures were intuitive once the Holding Wheel condition was learned.
Control was responsive even on CPU-only systems, especially without simulation load.

5. Deployment & Future Work

Hand-Steer Sim is fully containerized and runs with one ROS launch or Docker command.
Pretrained models and configuration files are included, with reproducible training notebooks.

Run It

Native ROS (recommended):

roslaunch hand_steer_sim sign_control.launch \
           control_mode:=steering \
           use_gpu:=true \
           show_image:=false

Or via Docker:

docker run --rm -it --gpus all \
  -v $(pwd)/hand_steer_sim/model:/ws/src/hand_steer_sim/model \
  yunusdanabas/hand_steer_sim:gpu

Demo Videos

Live inference overlay with gesture predictions and FPS.

Driving demonstration in Gazebo using gestures only.

Future Work

Ackermann steering – make wheel gestures feel more intuitive
Two-handed control – enable lights, indicators, emergency stop
User studies – evaluate learnability and gesture fatigue
Attention models – explore finer finger/hand nuance detection

Hand-Steer Sim was built as a solo capstone project for EE417 — Computer Vision (Spring 2025, Sabancı University).

No license restrictions — feel free to fork, adapt, and improve.