Hand-Steer Sim — Real-Time Gesture Teleoperation for Mobile Robots
Hand-Steer Sim is a webcam-driven teleoperation stack that translates hand gestures into ROS /cmd_vel commands in real time. MediaPipe landmarks feed a 1 k-param MLP for four static actions and a 6 k-param LSTM for steering trajectories, achieving 99 % accuracy with 13 ms end-to-end latency. One launch or Docker run controls a Gazebo (or real) differential-drive robot—no joystick required.
GitHub Repo Final Report (PDF)
1. Overview & Motivation
Hand-Steer Sim is a gesture-based robot teleoperation system that uses a simple webcam (or RealSense) to control differential-drive robots in real time—no joystick required.
It combines MediaPipe hand tracking with lightweight neural models to recognize both static commands (e.g., Stop, Speed Up) and steering gestures (e.g., Turn Left). Outputs are published to /cmd_vel
for use in Gazebo or real robots.
This project was developed as a solo capstone for EE417: Computer Vision at Sabancı University (Spring 2025) with the goal of making robot teleop:
- Cheaper – no special hardware
- Smarter – hybrid gesture control
- Faster – 13 ms latency on GPU
- Reproducible – Docker & training notebooks included ```
2. System Design
Hand-Steer Sim is structured as a modular ROS Noetic pipeline, with each major component—vision input, gesture inference, command fusion, and velocity output—implemented as an independent ROS node. This modularity allows easy deployment, testing, and integration with both simulation and real robots.
Gesture-to-Velocity Pipeline
- Camera Input – Streams 960×540 RGB frames at 30 FPS
- Landmark Extraction – MediaPipe Hands detects 21 hand keypoints per frame
- Dual-Branch Inference:
- Static MLP → detects one-shot gestures:
Stop
,Holding Wheel
,Speed Up
,Speed Down
- Dynamic LSTM → processes 16-frame MCP trajectories for:
Turn Left
,Turn Right
,Forward
- Static MLP → detects one-shot gestures:
- Gesture Fusion – Steering gestures only apply when Holding Wheel is detected
- ROS Mapping – Translates gestures into
geometry_msgs/Twist
velocity commands - Actuation – Commands sent to Gazebo or a real robot via
/cmd_vel

Figure – Full architecture showing ROS nodes and topic flow.
ROS Node Breakdown
Node | Role | Publishes / Subscribes |
---|---|---|
hsim_camera_pub | Streams camera frames | /image_raw |
hsim_steer_sign | Landmark detection + MLP/LSTM inference | /gesture/static , /gesture/dynamic |
hsim_wheel2twist | Gated fusion + velocity output | /cmd_vel |
gazebo_ros_control | Simulated diff-drive robot controller | /cmd_vel input |
Launching the System
roslaunch hand_steer_sim sign_control.launch \
control_mode:=steering \
show_image:=true \
use_gpu:=true
This command starts the full loop:
- Webcam or RealSense input
- Gesture recognition
- Fusion and velocity mapping
- Gazebo simulation (optional)
You can toggle between static-only or steering control via control_mode
.
3. Models & Data
Hand-Steer Sim uses a dual-branch neural architecture to recognize two gesture types:
Static gestures from a single frame, and
Dynamic gestures from short motion trajectories.
Each branch is optimized for speed and low memory, allowing real-time deployment on CPU or GPU—even in simulation.
Neural Models
Static Gesture Classifier (MLP)
- Input: 42-D vector (21 hand landmarks × 2D, wrist-normalized and scaled)
- Classes:
Stop
,Holding Wheel
,Speed Up
,Speed Down
- Architecture: 20 → 10 → 4 neurons
- Size: ~1.1k parameters → 4.4 kB (TFLite, FP16)
Dynamic Steering Classifier (LSTM)
- Input: 128-D vector (16 frames × 4 MCP joints × 2D)
- Classes:
Turn Left
,Turn Right
,Forward
- Architecture: LSTM(32) → Dense(32) → 3-class Softmax
- Size: ~6.4k parameters → 25 kB (TFLite, FP16)
Inference latency:
- GPU (RTX 4060 Ti): ~8.5 ms total (both branches)
- CPU (ThinkPad E14): ~20 ms total
Gesture Types
Branch | Gesture | Purpose |
---|---|---|
Static | Stop | Freeze all movement |
Holding Wheel | Enable dynamic gestures | |
Speed Up | Increment linear velocity | |
Speed Down | Decrement linear velocity | |
Dynamic | Turn Left | Adjust angular velocity (+) |
Turn Right | Adjust angular velocity (–) | |
Forward | Maintain direction (no turn) |
Note: All dynamic outputs are gated by the Holding Wheel gesture.
Dataset Overview
Collected using a custom GUI that overlays landmarks and allows fast class labeling.
All data was captured using a RealSense D435 or webcam at 960 × 540 @ 30 FPS.
Gesture Type | Classes | Total Samples | Train / Val / Test Split |
---|---|---|---|
Static | 4 | ~4,600 | 60% / 15% / 25% |
Dynamic | 3 | ~1,700 | 60% / 15% / 25% |

Figure – Real-time recording interface used for data collection.
4. Results & Performance
Hand-Steer Sim was benchmarked for gesture accuracy, latency, and real-time responsiveness under both CPU and GPU conditions. Both classifiers were tested on held-out test sets with no data leakage across splits.
Accuracy (Test Set)
Static MLP – 4-class classification
- Accuracy: 99.65%
- Macro-F1: 1.00
Dynamic LSTM – 3-class classification
- Accuracy: 99.77%
- Macro-F1: 1.00
Confusion Matrix — Static MLP
| True \ Pred | Stop | Hold | Up | Down | |————-|—–:|—–:|—:|—–:| | Stop | 464 | 0 | 0 | 0 | | Hold | 0 | 125 | 0 | 0 | | Up | 0 | 0 | 279 | 0 | | Down | 0 | 0 | 1 | 287 |
Confusion Matrix — Dynamic LSTM
| True \ Pred | Left | Right | Forward | |————-|—–:|——:|——–:| | Left | 128 | 0 | 0 | | Right | 0 | 126 | 0 | | Forward | 0 | 1 | 174 |
Latency Benchmarks
All timing was measured on 960 × 540 input frames at 30 FPS.
Platform | MediaPipe + Inference | Full Loop (E2E) | FPS |
---|---|---|---|
GPU (RTX 4060 Ti) | 8.6 ms | 13.2 ms | 75 |
CPU (ThinkPad E14) | 20.1 ms | 25.0 ms | 39 |
CPU + Gazebo Sim | 94.1 ms | 109.6 ms | 9 |
- All gestures were recognized well within real-time limits.
- Temporal smoothing (majority voting over 16 frames) ensured stable behavior.
Observations
- Forward gesture was hardest to maintain continuously due to its symmetry; proposed dead-zone filtering may improve robustness.
- Steering gestures were intuitive once the Holding Wheel condition was learned.
- Control was responsive even on CPU-only systems, especially without simulation load.
5. Deployment & Future Work
Hand-Steer Sim is fully containerized and runs with one ROS launch or Docker command.
Pretrained models and configuration files are included, with reproducible training notebooks.
Run It
Native ROS (recommended):
roslaunch hand_steer_sim sign_control.launch \
control_mode:=steering \
use_gpu:=true \
show_image:=false
Or via Docker:
docker run --rm -it --gpus all \
-v $(pwd)/hand_steer_sim/model:/ws/src/hand_steer_sim/model \
yunusdanabas/hand_steer_sim:gpu
Demo Videos
Live inference overlay with gesture predictions and FPS.
Driving demonstration in Gazebo using gestures only.
Future Work
- Ackermann steering – make wheel gestures feel more intuitive
- Two-handed control – enable lights, indicators, emergency stop
- User studies – evaluate learnability and gesture fatigue
- Attention models – explore finer finger/hand nuance detection
Hand-Steer Sim was built as a solo capstone project for EE417 — Computer Vision (Spring 2025, Sabancı University).
No license restrictions — feel free to fork, adapt, and improve.