Skip to content

Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time Aditya Sharma PyImageSearch

  • by

Table of Contents

Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time

In this tutorial, you will learn to perform hand gesture recognition using YOLOv8 on the OAK-D platform. In our previous OAK-102 series tutorial:

We trained the YOLOv8 object detection model in the PyTorch framework using the Ultralytics repository. We will take it a step further by deploying the model on the OAK-D device. To perform hand gesture recognition on the OAK device, we would optimize the PyTorch model weights into the MyriadX blob file format using the Luxonis toolkit. We will then utilize the DepthAI API to run a real-time hand gesture recognition application on the OAK-D device.

By the end of this tutorial, you will have a good understanding of the process involved in deploying an object detection model on the OAK-D platform, as well as the skills to recognize hand gestures using OAK-D’s camera with the help of the DepthAI API in Python.

This lesson is the 2nd in our 3-part series on OAK 102:

Training the YOLOv8 Object Detector for OAK-DHand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time (today’s tutorial)OAK 102 (lesson 3)

To learn how to run a hand gesture recognition application on OAK-D, just keep reading.

Looking for the source code to this post?

Jump Right To The Downloads Section

Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time

Introduction

Hand gesture recognition is an emerging field in computer vision focusing on identifying and interpreting human hand gestures using computer vision and deep learning. It has many applications like sign language recognition, human-computer interaction, virtual reality, and gaming.

Computer vision techniques like object detection are used to analyze and interpret images or videos of human movements to recognize hand gestures. However, hand gesture recognition often requires temporal information, which means analyzing the movement of the human body over time for the algorithm to accurately predict the form of the hand gesture.

For instance, if you want to control the volume of your car’s infotainment system through hand gesture recognition, the algorithm must distinguish between volume up and volume down hand gestures. The deep learning model would require temporal information instead of just spatial information to understand the volume up and volume down hand gestures.

Figure 1 shows sample images from the hand gesture recognition dataset with ground-truth bounding boxes annotated in red, belonging to classes four, five, two, and three. In today’s tutorial, we will learn to recognize one of these five hand gestures using the OAK-D platform.

Figure 1: Sample images from the Hand Gesture Recognition Dataset with ground-truth annotations.

If you’re interested in the capabilities of an OAK device and the computer vision applications it can run, check out the Introduction to OpenCV AI Kit (OAK) tutorial on PyImageSearch. Specifically, you should look at the Applications on OAK section of the tutorial.

If you followed our previous tutorial on the OAK-102 series, you would be familiar with training a state-of-the-art YOLOv8 object detector using PyTorch with Ultralytics repository, which can recognize hand gestures like one, two, and three. The YOLOv8 model was pretrained on the MS COCO dataset and fine-tuned on only 587 hand gesture images. Despite the small number of images used for training, the model performed surprisingly well, achieving 0.824 mAP@0.5 IoU with the Nano variant and 0.887 mAP@0.5 IoU with the Small variant.

This tutorial will take it further and show you how to deploy the hand gesture recognition model onto the OAK-D device. We will first use the Luxonis toolkit to convert and optimize the PyTorch model for hand gesture recognition into the MyriadX blob file format to accomplish this. Next, we will run the converted model on the OAK-D device using still images and the camera stream.

How would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free. Head over to Roboflow and get a free account to grab these hand gesture images.

Let’s start with today’s tutorial and delve into deploying and running hand gesture recognition applications on the OAK device without further ado.

Configuring Your Development Environment

To follow this guide, you need to have depthai, opencv, and imutils libraries installed on your system.

Luckily, all these libraries are pip-installable:

$ pip install depthai
$ pip install opencv-python
$ pip install imutils

If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in minutes.

Need Help Configuring Your Development Environment?

Figure 2: Need help configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in minutes.

All that said, are you:

Short on time?Learning on your employer’s administratively locked system?Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?Ready to run the code on your Windows, macOS, or Linux system now?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

We first need to review our project directory structure.

Start by accessing this tutorial’s “Downloads” section to retrieve the source code and example images.

From there, take a look at the directory structure:

$ tree .
.
├── gesture_recognition_model
│ ├── yolov8n
│ │ ├── yolov8ntrained-simplified.onnx
│ │ ├── yolov8ntrained.bin
│ │ ├── yolov8ntrained.json
│ │ ├── yolov8ntrained.xml
│ │ └── yolov8ntrained_gesture_recog.blob
│ └── yolov8s
│ ├── yolov8strained-simplified.onnx
│ ├── yolov8strained.bin
│ ├── yolov8strained.json
│ ├── yolov8strained.xml
│ └── yolov8strained_gesture_recog.blob
├── pyimagesearch
│ ├── __init__.py
│ ├── config.py
│ └── utils.py
├── recognize_camera.py
├── recognize_images.py
├── results
│ ├── gesture_camera_v8n.mp4
│ ├── gesture_camera_v8s.mp4
│ ├── gesture_pred_images_v8n
│ │ ├── five.jpg
│ │ ├── five_2.jpg
│ │ ├── four.jpg
│ │ ├── one.jpg
│ │ ├── three.jpg
│ │ └── two.jpg
│ └── gesture_pred_images_v8s
│ ├── five.jpg
│ ├── five_2.jpg
│ ├── four.jpg
│ ├── one.jpg
│ ├── three.jpg
│ └── two.jpg
├── test_data
│ ├── five.jpg
│ ├── five_2.jpg
│ ├── four.jpg
│ ├── one.jpg
│ ├── three.jpg
│ └── two.jpg
└── test_images

9 directories, 35 files

In the pyimagesearch directory, we have the following files:

config.py: The configuration file for the taskutils.py: The utilities for running the hand gesture recognition on OAK (e.g., creating images and camera pipelines and a few other helper functions)

In the core directory, we have the following:

gesture_recognition_model: Houses the hand gesture recognition trained model files converted to OpenVINO format (.blob) as required by OAK hardware for both YOLOv8n and YOLOv8stest_data: It contains a few hand gesture images from the test set, which the recognize_images.py script will use recognize_images.py: The inference script to leverage OAK’s neural accelerator for recognizing hand gestures in imagesrecognize_camera.py: The inference script to run hand gesture recognition with OAK’s color camera

In the results directory, we have:

gesture_pred_images_v8n: Hosts the prediction results performed on the test_data images with the YOLOv8n modelgesture_pred_images_v8s: Hosts the prediction results performed on the test_data images with the YOLOv8s modelgesture_camera_v8n.mp4: The prediction output file when inference is performed with OAK’s 4K color camera leveraging the YOLOv8n modelgesture_camera_v8s.mp4: The prediction output file when inference is performed with OAK’s 4K color camera leveraging the YOLOv8s model

YOLOv8 Model Export to OAK

In this section, we will discuss the steps required to convert the YOLOv8n and YOLOv8s hand gesture recognition models (in PyTorch framework) to MyriadX blob file format.

For a better understanding of this and the following sections of this tutorial, we highly recommend you check out our previous tutorial on Training the YOLOv8 Object Detector for OAK-D.

By now, you already know that the OAK device is not directly compatible with the popular deep learning frameworks, necessitating the conversion of the model into the MyriadX blob file format, which the device supports.

The underlying reason for this specific model format requirement, and the incompatibility of the widely used deep learning frameworks with the OAK device, is the presence of a visual processing unit within the hardware. This unit is built on Intel’s MyriadX processor, which mandates using the blob file format for models.

To export a YOLOv8 model for deployment on an OAK device, you need to follow several steps to convert the model into the MyriadX blob file format. Here’s a step-by-step guide:

Model Conversion: First, you need to convert the YOLOv8 model from its native format (typically PyTorch or TensorFlow) to an intermediate format called ONNX (Open Neural Network Exchange). Optimization: To ensure optimal performance on the OAK device, you need to optimize the ONNX model using OpenVINO’s Model Optimizer. This step will generate an Intermediate Representation (IR) consisting of two files: an XML file containing the model architecture and a BIN file containing the model weights.Blob Conversion: Use OpenVINO’s compile_tool to convert the optimized IR model to a MyriadX blob file format.

Luckily, Luxonis offers a convenient tool that streamlines converting YOLOv5-v8 PyTorch models to the MyriadX blob file format. For example, in this tutorial, we aim to deploy a YOLOv8 hand gesture recognition model developed in the PyTorch framework on an OAK device for recognizing hand gestures. Hence, this tool is valuable in assisting us with the YOLOv8 PyTorch model conversion to the MyriadX blob file format.

The following steps will guide you through generating MyriadX blob files for YOLOv8n and YOLOv8s models:

Assuming you’ve completed the previous tutorial, you should have fine-tuned YOLOv8n and YOLOv8s PyTorch model weights (using the hand gesture recognition dataset) saved on your local machine.Navigate to the Automatic Yolo exports for OAKs page.As demonstrated in the GIF, set the Yolo Version to “YoloV8 (detection only).”In the File section, upload the yolov8ntrained.pt weights.Set the Input shape to 416.Repeat the above steps for the YOLOv8s model, but this time upload the yolov8strained.pt weights in the File parameter.

Please click the image below to follow the steps to generate the MyriadX blob files for both the YOLOv8n and YOLOv8s models.

Deploy on OAK device: Having generated the blob file, we can proceed to the following section. We will deploy the YOLOv8 model on an OAK device (either OAK-1 or OAK-D) utilizing the DepthAI library.

Configuring the Prerequisites

Before we start our implementation, let’s review our project’s configuration. For that, we will move on to the config.py script located in the pyimagesearch directory.

The config.py script sets up the necessary paths for the YOLOv8n and YOLOv8s models, their configurations, test data, and output directories for the resulting images and videos. It also defines the camera preview dimensions and label names for the hand gestures.

# import the necessary packages
import os
import glob

# define path to the model, test data directory and results
YOLOV8N_MODEL = os.path.join(
“gesture_recognition_model”,”yolov8n”,”yolov8ntrained_gesture_recog.blob”
)
YOLOV8N_CONFIG = os.path.join(
“gesture_recognition_model”,”yolov8n”,”yolov8ntrained.json”
)

YOLOV8S_MODEL = os.path.join(
“gesture_recognition_model”,”yolov8s”,”yolov8strained_gesture_recog.blob”
)
YOLOV8S_CONFIG = os.path.join(
“gesture_recognition_model”,”yolov8s”,”yolov8strained.json”
)

TEST_DATA = glob.glob(“test_data/*.jpg”)
OUTPUT_IMAGES_YOLOv8n = os.path.join(“results”, “gesture_pred_images_v8n”)
OUTPUT_IMAGES_YOLOv8s = os.path.join(“results”, “gesture_pred_images_v8s”)
OUTPUT_VIDEO_YOLOv8n = os.path.join(“results”, “gesture_camera_v8n.mp4”)
OUTPUT_VIDEO_YOLOv8s = os.path.join(“results”, “gesture_camera_v8s.mp4”)

# define camera preview dimensions same as YOLOv8 model input size
CAMERA_PREV_DIM = (416, 416)

# define the class label names list
LABELS = [“Five”, “Four”, “One”, “Three”, “Two”]

On Lines 2 and 3, we import the os and glob modules. Then, from Lines 6-19, we define the following:

path to the YOLOv8n and YOLOv8s hand gesture recognition modelstest data directoryoutput locations for saving inference results of YOLOv8n and YOLOv8s on images and videos

From Lines 28-31, we also define the camera preview dimension and a list of class label names to help decode class predictions to human-readable class names.

Defining the Utilities

Now that the configuration has been defined, we can determine the utilities for creating images, camera DepthAI pipelines, and a few helper functions. The utils.py script defines several functions:

Helps create the pipeline for hand gesture recognition on OAK with imagesPipeline for hand gesture recognition on OAK with color camera streamDefine a few helper functions for loading configuration files, annotating camera frames, and normalizing the predictions

Creating the Images Pipeline

# import the necessary packages
from pyimagesearch import config
import json
import numpy as np
import cv2
from pathlib import Path
import depthai as dai

def create_image_pipeline(config_path, model_path):
# initialize a depthai pipeline
pipeline = dai.Pipeline()

# load model config file and fetch nn_config parameters
print(“[INFO] loading model config…”)
configPath = Path(config_path)
model_config = load_config(configPath)
nnConfig = model_config.get(“nn_config”, {})

On Lines 2-7, we import the necessary packages like config from the pyimagesearch module, json for loading model configuration files, cv2 for image-related operations, and most importantly, the depthai module along with the numpy and Path modules.

We define the function create_images_pipeline() that takes config_path and model_path as input parameters on Line 9. Then, a depthai pipeline is initialized on the host, which helps define the nodes, the flow of data, and communication between the nodes (Line 11).

On Lines 15 and 16, we load the model configuration file using the provided config_path and extract the nn_config object from the model configuration on Line 17.

print(“[INFO] extracting metadata from model config…”)
# using nnConfig extract metadata like classes,
# iou and confidence threshold, number of coordinates
metadata = nnConfig.get(“NN_specific_metadata”, {})
classes = metadata.get(“classes”, {})
coordinates = metadata.get(“coordinates”, {})
anchors = metadata.get(“anchors”, {})
anchorMasks = metadata.get(“anchor_masks”, {})
iouThreshold = metadata.get(“iou_threshold”, {})
confidenceThreshold = metadata.get(“confidence_threshold”, {})

On Line 22, we extract the relevant metadata NN_specific_metadata using the nnConfig variable. With the help of metadata, we extract information like classes, coordinates, anchors, iouThreshold, and confidenceThreshold. We would pass this information to the detection network node later.

print(“[INFO] configuring inputs and output…”)
# configure inputs for depthai pipeline
# since this pipeline is dealing with images an XLinkIn node is created
detectionIN = pipeline.createXLinkIn()
# create a Yolo detection node
detectionNetwork = pipeline.create(dai.node.YoloDetectionNetwork)
# create a XLinkOut node for fetching the neural network outputs to host
nnOut = pipeline.create(dai.node.XLinkOut)

print(“[INFO] setting stream names for queues…”)
# set stream names used in queue to fetch data when the pipeline is started
nnOut.setStreamName(“nn”)
detectionIN.setStreamName(“detection_in”)

Next, we create inputs for the depthai pipeline. First, we create an XLinkIn() node since this pipeline deals with images on Line 33. Then, a YoloDetectionNetwork node is created for performing detection and an XLinkOut node for sending the neural network output to the host on Lines 35-37.

On Lines 41 and 42, we set the output stream names, which would later be used in the queue to fetch image and neural network data when the pipeline is started.

print(“[INFO] setting YOLO network properties…”)
# network specific settings – parameters read from config file
# confidence and iou threshold, classes, coordinates are set
# most important the model .blob file is used to load weights
detectionNetwork.setConfidenceThreshold(confidenceThreshold)
detectionNetwork.setNumClasses(classes)
detectionNetwork.setCoordinateSize(coordinates)
detectionNetwork.setAnchors(anchors)
detectionNetwork.setAnchorMasks(anchorMasks)
detectionNetwork.setIouThreshold(iouThreshold)
detectionNetwork.setBlobPath(model_path)
detectionNetwork.setNumInferenceThreads(2)
detectionNetwork.input.setBlocking(False)

With the detectionNetwork node defined, we set the YOLO network properties based on the metadata extracted from the model configuration file from Lines 48-56. One important thing to note is that we load the model weights (i.e., setBlobPath) on Line 54 by passing the model_path as a parameter for the YOLO model.

print(“[INFO] creating links…”)
# linking the nodes – image node output is linked to detection node
# detection network node output is linked to XLinkOut input
detectionIN.out.link(detectionNetwork.input)
detectionNetwork.out.link(nnOut.input)

# return the pipeline to the calling function
return pipeline

On Lines 61 and 62, we link the detectionIN ( XLinkIn node) output to the YOLO detection network node (detectionNetwork) input and the YOLO detection network node output to the nnOut (XLinkOut node) input.

Finally, the pipeline object is returned to the calling function on Line 65.

Creating the Camera Pipeline

def create_camera_pipeline(config_path, model_path):
# initialize a depthai pipeline
pipeline = dai.Pipeline()

# load model config file and fetch nn_config parameters
print(“[INFO] loading model config…”)
configPath = Path(config_path)
model_config = load_config(configPath)
nnConfig = model_config.get(“nn_config”, {})

print(“[INFO] extracting metadata from model config…”)
# using nnConfig extract metadata like classes,
# iou and confidence threshold, number of coordinates
metadata = nnConfig.get(“NN_specific_metadata”, {})
classes = metadata.get(“classes”, {})
coordinates = metadata.get(“coordinates”, {})
anchors = metadata.get(“anchors”, {})
anchorMasks = metadata.get(“anchor_masks”, {})
iouThreshold = metadata.get(“iou_threshold”, {})
confidenceThreshold = metadata.get(“confidence_threshold”, {})

# output of metadata – feel free to tweak the threshold parameters
#{‘classes’: 5, ‘coordinates’: 4, ‘anchors’: [], ‘anchor_masks’: {},
# ‘iou_threshold’: 0.5, ‘confidence_threshold’: 0.5}
print(metadata)

We define the function create_camera_pipeline() that takes config_path and model_path as input parameters on Line 67.

The rest of the code from Lines 69-86 is the same as we defined in the create_images_pipeline() function, so we skip the explanation here. But generally, we initialize a pipeline, load the configuration file, and extract metadata from the configuration file.

print(“[INFO] configuring source and outputs…”)
# define sources and outputs
# since OAK’s camera is used in this pipeline
# a color camera node is defined
camRgb = pipeline.create(dai.node.ColorCamera)
# create a Yolo detection node
detectionNetwork = pipeline.create(dai.node.YoloDetectionNetwork)
xoutRgb = pipeline.create(dai.node.XLinkOut)
# create a XLinkOut node for getting the detection results to host
nnOut = pipeline.create(dai.node.XLinkOut)

print(“[INFO] setting stream names for queues…”)
# set stream names used in queue to fetch data when the pipeline is started
xoutRgb.setStreamName(“rgb”)
nnOut.setStreamName(“nn”)

We now establish input and output nodes for the camera pipeline, which differs from the previously defined image pipeline. The input source is an OAK device camera rather than images in this case.

On Lines 97-102, we create the following nodes:

ColorCamera: an input to the detection networkYoloDetectionNetwork: for performing inferenceXLinkOut: for displaying the detection results on the host computer

Then on Lines 106 and 107, we set stream names from respective xoutRgb and nnOut nodes to rgb and nn, which would be passed as a parameter to the OutputQueue for fetching the color frames and detections.

print(“[INFO] setting camera properties…”)
# setting camera properties like the output preview size,
# camera resolution, color channel ordering and FPS
camRgb.setPreviewSize(config.CAMERA_PREV_DIM)
camRgb.setResolution(dai.ColorCameraProperties.SensorResolution.THE_1080_P)
camRgb.setInterleaved(False)
camRgb.setColorOrder(dai.ColorCameraProperties.ColorOrder.BGR)
camRgb.setFps(40)

On Lines 112-116, we configure the camRgb (ColorCamera node) properties, such as setting the preview window resolution, selecting 1080P as the camera resolution, defining the color channel order as BGR, and establishing a camera frame rate of 40 FPS.

print(“[INFO] setting YOLO network properties…”)
# network specific settings – parameters read from config file
# confidence and iou threshold, classes, coordinates are set
# most important the model .blob file is used to load weights
detectionNetwork.setConfidenceThreshold(confidenceThreshold)
detectionNetwork.setNumClasses(classes)
detectionNetwork.setCoordinateSize(coordinates)
detectionNetwork.setAnchors(anchors)
detectionNetwork.setAnchorMasks(anchorMasks)
detectionNetwork.setIouThreshold(iouThreshold)
detectionNetwork.setBlobPath(model_path)
detectionNetwork.setNumInferenceThreads(2)
detectionNetwork.input.setBlocking(False)

print(“[INFO] creating links…”)
# linking the nodes – camera stream output is linked to detection node
# RGB frame is passed through detection node linked with XLinkOut
# used for annotating the frame with detection output
# detection network node output is linked to XLinkOut input
camRgb.preview.link(detectionNetwork.input)
detectionNetwork.passthrough.link(xoutRgb.input)
detectionNetwork.out.link(nnOut.input)

# return the pipeline to the calling function
return pipeline

We can omit the explanation of Lines 122-130 since they are identical to the steps we performed while constructing the image pipeline.

Lines 137-139 create connections between the nodes in the DepthAI pipeline:

camRgb.preview.link(detectionNetwork.input): This line connects the output of the camRgb.preview node (camera RGB preview) to the input of the detectionNetwork node (YoloDetectionNetwork). This connection provides the camera’s RGB preview data as input to the object detection network.detectionNetwork.passthrough.link(xoutRgb.input): This line links the passthrough output of the detectionNetwork node to the input of the xoutRgb node (XLinkOut). The passthrough output provides the original input image (RGB preview) to be sent to the host and the object detection results. This passthrough connection also helps maintain the synchronization between the RGB frames and the detection results.detectionNetwork.out.link(nnOut.input): This line connects the output of the detectionNetwork node (YoloDetectionNetwork) to the input of the nnOut node (XLinkOut). This node forwards the object detection results to the host for further processing or display.

The above connections establish a pipeline where the camera RGB preview data is input for the YOLO object detection network. The original input image and the detection results are sent to the host.

Finally, the pipeline object is returned to the calling function on Line 142.

Defining a Few Helper Functions

def load_config(config_path):
# open the config file and load using json module
with config_path.open() as f:
config = json.load(f)
return config

def annotateFrame(frame, detections, model_name):
# loops over all detections in a given frame
# annotates the frame with model name, class label,
# confidence score, and draw bounding box on the object
color = (0, 0, 255)
for detection in detections:
bbox = frameNorm(frame, (detection.xmin, detection.ymin, detection.xmax, detection.ymax))
cv2.putText(frame, model_name, (20, 40), cv2.FONT_HERSHEY_TRIPLEX, 1,
color)
cv2.putText(frame, config.LABELS[detection.label], (bbox[0] + 10, bbox[1] + 25), cv2.FONT_HERSHEY_TRIPLEX, 1,
color)
cv2.putText(frame, f”{int(detection.confidence * 100)}%”, (bbox[0] + 10, bbox[1] + 60),
cv2.FONT_HERSHEY_TRIPLEX, 1, color)
cv2.rectangle(frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), color, 2)
return frame

On Lines 145-149, we define the load_config() function that takes config_path as an argument, opens the configuration file using context managers, and reads its contents. The file’s contents are loaded as a JSON object using the json module and returned to the calling function.

Next, on Lines 151-165, the annotateFrame() method is defined, which takes the following three arguments as an input:

frame: The input image/frame.detections: A list of detected objects in the frame provided by the YOLOv8 object detection network.model_name: The name of the model used for object detection (YOLOv8n/YOLOv8s).

Then the function iterates over each detection in the detections list on Line 156. For each detection, it does the following:

normalizes the bounding box coordinates using the frameNorm function (not shown in the provided code discussed later)annotates the frame with the model_name in the top-left cornerannotates the frame with the class label of the detected object using the corresponding label from the config.LABELS listannotates the frame with the detected object’s confidence score (percentage)draws a bounding box around the detected object using the cv2.rectangle function

Finally, on Line 165, the annotateFrame function returns the annotated frame after adding the model name, class labels, confidence scores, and bounding boxes for each detected object. This annotated frame can be displayed, saved, or used for further processing.

def to_planar(arr: np.ndarray, shape: tuple) -> np.ndarray:
# resize the image array and modify the channel dimensions
resized = cv2.resize(arr, shape)
return resized.transpose(2, 0, 1)

def frameNorm(frame, bbox):
# nn data, being the bounding box locations, are in <0..1> range
# normalized them with frame width/height
normVals = np.full(len(bbox), frame.shape[0])
normVals[::2] = frame.shape[1]
return (np.clip(np.array(bbox), 0, 1) * normVals).astype(int)

On Lines 168-171, we define the to_planar() method; it accepts two parameters as input

arr: a numpy array, which represents the input image and shape: a tuple representing the desired shape for the output image

The purpose of the to_planar() method is to resize a given numpy array (image) to the desired shape and modify the channel dimensions using the transpose function.

On Line 173, the frameNorm() function accepts input parameters

frame: a numpy array representing input frame/image bbox: list or numpy array containing the bounding box locations in the normalized range from 0 to 1

The function normalizes the bounding box locations with respect to the frame dimensions. On Lines 176-178:

A normVals numpy array is initialized with the same length as the input bounding box array and is filled with the frame’s height. Then, every alternate value in normVals is set to the frame’s width.The input bounding box values are clipped to the range [0, 1] and multiplied by the normVals to scale the bounding box coordinates to the frame dimensions. Finally, the scaled bounding box coordinates are converted to integers using the astype(int) method.

Hand Gesture Recognition in Images

With the configurations and utilities implemented, we can finally get into the code walkthrough of recognizing hand gestures in images with OAK-D.

# import the necessary packages
from pyimagesearch import config
from pyimagesearch import utils
import argparse
import cv2
import depthai as dai

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(“-m”, “–model”, help=”Provide model name for inference”,
default=’yolov8n’, type=str)
args = parser.parse_args()

We start by importing the necessary packages on Lines 6-10:

config module from the pyimagesearch package, which contains configuration settings or constantsutils from the pyimagesearch package, which provides utility functions for creating image and camera pipelinesargparse for parsing command line arguments cv2 for performing image and video processing tasksdepthai module for creating DepthAI pipelines for OAK-D

Then, on Lines 13-16, we create a command line argument -m or –model where users can provide the model name for hand gesture recognition which can be either YOLOv8n or YOLOv8s.

# initialize a depthai camera pipeline
print(“[INFO] initializing a depthai images pipeline…”)
model_name = args.model
# if the model for inference is nano variant create image pipeline
# with nano variant config file and model weights
if model_name.lower() == “yolov8n”:
pipeline = utils.create_image_pipeline(config_path=config.YOLOV8N_CONFIG,
model_path=config.YOLOV8N_MODEL)
output_image_path = config.OUTPUT_IMAGES_YOLOv8n

# if the model for inference is small variant create image pipeline
# with small variant config file and model weights
else:
pipeline = utils.create_image_pipeline(config_path=config.YOLOV8S_CONFIG,
model_path=config.YOLOV8S_MODEL)
output_image_path = config.OUTPUT_IMAGES_YOLOv8s

On Line 21, we assign the model name provided via command line arguments to the model_name variable.

Then from Lines 24-34:

If the model name equals yolov8n, we create an image pipeline using the utility function utils.create_image_pipeline with the YOLOv8 Nano variant configuration file and model weights (specified in the config module). Also, via the config module, set the output_image_path variable to the appropriate output directory for the YOLOv8 Nano variant.If the model name is not yolov8n, then we perform the same steps as above, but only this time, we initialize the image pipeline with YOLOv8 Small variant configuration and model weights file.

In summary, the code initializes an image pipeline using the DepthAI library. Then, it sets up the appropriate configuration, model weights, and output image path based on the model name provided by the user.

# pipeline defined, now the device is assigned and pipeline is started
with dai.Device(pipeline) as device:
# define the queues that will be used in order to communicate with
# depthai and then send our input image for predictions
detectionIN = device.getInputQueue(“detection_in”)
detectionNN = device.getOutputQueue(“nn”)

print(“[INFO] loading image from disk…”)
for img_path in config.TEST_DATA:
# load the input image and then resize it
image = cv2.imread(img_path)
image_res = cv2.resize(image, config.CAMERA_PREV_DIM)
# create a copy of image for inference
image_copy = image.copy()
# initialize depthai NNData() class which is fed with the
# image data resized and transposed to model input shape
nn_data = dai.NNData()
nn_data.setLayer(
“input”,
utils.to_planar(image_copy, config.CAMERA_PREV_DIM)
)
# send the image to detectionIN queue further passed
# to the detection network for inference as defined in pipeline
detectionIN.send(nn_data)

Next, on Lines 37-59, we set up the DepthAI device and process a list of test images for recognizing hand gestures using the previously initialized pipeline. The input images are resized and preprocessed before being fed to the YOLO network for inference. The processed images are then sent to the DepthAI device for hand gesture recognition.

On Line 37, we create a context for the DepthAI device using the with statement, which ensures that the device is properly closed after use. The device is initialized with the previously created pipeline.

Two queues for communicating with the OAK device are defined on Lines 40 and 41:

detectionIN: input queue for sending images to the device for detectiondetectionNN: output queue for receiving the detection results from the device

On Lines 44-59, we iterate over the test images specified in the config.TEST_DATA variable. For each test image, we

load the image using the cv2.imread functionresize the loaded image to the dimensions specified in config.CAMERA_PREV_DIM using the cv2.resize methodcreate a copy of the loaded image for preprocessinginitialize a dai.NNData() object to store the preprocessed image datapreprocess the copied image to the model input shape and set the preprocessed image data as the input layer of the dai.NNData() objectsend the preprocessed image data to the detectionIN queue, which is further passed to the YOLO detection network for inference as defined in the image pipeline in utils.py

print(“[INFO] fetching neural network output for {}”.
format(img_path.split(‘/’)[1]))
# fetch the neural network output
inDet = detectionNN.get()
# if detection is available for given image, fetch the detections
if inDet is not None:
detections = inDet.detections

# if object detected, annotate the image
image_res = utils.annotateFrame(image_res, detections, args.model.lower())

# finally write the image to the output path
cv2.imwrite(
output_image_path +”/”+img_path.split(‘/’)[1],
image_res
)

With the image loaded from the disk, preprocessed, and inferred with the YOLO detection network, as a final step, we now fetch the neural network predictions (hand gesture recognition) for each test image, annotate the image with the detections, and finally save the annotated image to the disk.

On Line 64, we fetch the neural network output for the current image using the detectionNN.get() method and store the result in the inDet variable.

Next, on Lines 66-70, we check if the fetched neural network output is not None, meaning detections are available for the given image:

If detections are available, extract them from the inDet object using the detections attribute and store them in the detections variable.Annotate the resized image (image_res) with the detections using the utils.annotateFrame function. The function, as discussed before, draws bounding boxes and labels for each detected object on the image.

Finally, on Lines 73-76, the annotated image is written to the output path using the cv2.imwrite function.

Results

Great! Now that we have completed implementing the code for hand gesture recognition in images, it’s time to examine the results.

Figure 3 displays the hand gesture recognition outcomes on several test images using YOLOv8n when supplied to the OAK device. The figure shows that the converted (.blob) and optimized hand gesture recognition model identifies hand gestures in the test images well. However, among the six samples:

one False Positive (2nd row, 3rd image classified one as two)one False Negative (1st row, 2nd image failed to recognize hand gesture five though it was tough).

Figure 3: Recognizing hand gestures on test images with the optimized YOLOv8n model on the OAK device (source: image by the author).

Figure 4 showcases the hand gesture recognition results using the YOLOv8s variant on the same sample test images when executed on the OAK device. As observed in the figure, the YOLOv8s variant performed slightly better than the YOLOv8n variant, as it detected the hand gesture five (in the 1st row, 2nd image). However, this time the model failed to detect the hand gesture one (in the 2nd row, 3rd image), resulting in one False Negative.

Nonetheless, the overall performance was impressive. It would be intriguing to compare the performance of YOLOv8s and YOLOv8n when running hand gesture recognition on a camera stream.

Figure 4: Recognizing hand gestures on test images with the optimized YOLOv8s model on the OAK device (source: image by the author).

We can safely conclude that our DepthAI implementation for hand gesture recognition in images is effective. However, the minor inaccuracies observed in the test images depend on the model, which could potentially be improved for even better performance.

Hand Gesture Recognition in Camera Stream

Fantastic! We have now acquired the skill of identifying hand gestures in images using an OAK device. Next, it’s time to take things to the next level and utilize the OAK’s color camera to categorize the frames. This transition to leveraging OAK’s camera stream is where we believe the OAK module truly shines in practical application.

# import the necessary packages
from pyimagesearch import config
from pyimagesearch import utils
from imutils.video import FPS
import argparse
import time
import cv2
import depthai as dai

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(“-m”, “–model”, help=”Provide model name for inference”,
default=’yolov8n’, type=str)
args = parser.parse_args()

We start by importing the necessary packages on Lines 6-12

config and utils modules from the pyimagesearch package, which contains configuration settings/constants and utility functions for creating OAK pipelinesFPS module from imutils.video helps us compute how long the hand gesture recognition takes to perform inference on each frameargparse for parsing command line arguments cv2 for performing image and video processing tasksdepthai module for creating DepthAI pipelines for OAK

Then, on Lines 15-18, we create a command line argument -m or –model where users can provide the model name for hand gesture recognition which can be either YOLOv8n or YOLOv8s.

When the script is run, args will be a namespace object containing the values of any arguments passed to the script. If the -m argument is not provided, args.model will have the default value of yolov8n. If the -m argument is provided, its value will be stored in args.model.

# initialize a depthai camera pipeline
print(“[INFO] initializing a depthai camera pipeline…”)
model_name = args.model
# if the model for inference is nano variant create camera pipeline
# with nano variant config file and model weights
if model_name.lower() == “yolov8n”:
pipeline = utils.create_camera_pipeline(config_path=config.YOLOV8N_CONFIG,
model_path=config.YOLOV8N_MODEL)
output_video = config.OUTPUT_VIDEO_YOLOv8n

# if the model for inference is small variant create camera pipeline
# with small variant config file and model weights
else:
pipeline = utils.create_camera_pipeline(config_path=config.YOLOV8S_CONFIG,
model_path=config.YOLOV8S_MODEL)
output_video = config.OUTPUT_VIDEO_YOLOv8s

# set the video codec to use with video writer
fourcc = cv2.VideoWriter_fourcc(*’MJPG’)
# create video writer object with parameters: output video path,
# video codec, frame rate of output video, and dimensions of video frame
out = cv2.VideoWriter(
output_video,
fourcc,
20.0,
config.CAMERA_PREV_DIM
)

On Lines 22-36, we initialize a depthai camera pipeline for hand gesture recognition using the YOLOv8 Nano or Small model.

Based on the model variant (determined by args.model), we select the appropriate configuration file and model weights for the YOLOv8 Nano or Small variant on Lines 26-36:

If the model variant is YOLOv8 Nano (yolov8n), it creates a camera pipeline with the configuration file and model weights specified for the Nano variant.If the model variant is not YOLOv8 Nano, it assumes the model is the Small variant (yolov8s). It creates a camera pipeline with the configuration file and model weights specified for the Small variant.

We also set the output_video path where the inference results will be stored as a video.

Next, on Line 39, we set the video codec to MJPG using the cv2.VideoWriter_fourcc function.

Finally, on Lines 42-47, we create an OpenCV VideoWriter object that will be used to save the processed video with hand gesture recognition annotations. The VideoWriter is initialized with the following:

The output video path differs for the Nano and Small variants.The MJPG video codec.The frame rate of the output video is set to 20 frames per second (FPS).The dimensions of the video frame are specified by config.CAMERA_PREV_DIM.

# pipeline defined, now the device is assigned and pipeline is started
with dai.Device(pipeline) as device:

# output queues will be used to get the rgb frames
# and nn data from the outputs defined above
qRgb = device.getOutputQueue(name=”rgb”, maxSize=4, blocking=False)
qDet = device.getOutputQueue(name=”nn”, maxSize=4, blocking=False)

# initialize variables like frame, start time for NN FPS
# also start the FPS module timer, define color pattern for FPS text
frame = None
startTime = time.monotonic()
fps = FPS().start()
counter = 0
color2 = (255, 255, 255)

Then, we set up and start the depthai camera pipeline using the previously created pipeline configuration. We also initialize variables for processing video frames and calculating FPS.

On Line 50, the dai.Device class is used as a context manager to ensure the device is properly initialized and cleaned up. The pipeline configuration is passed to the device when it is created.

Next, on Lines 54 and 55, two output queues are created using the device.getOutputQueue() method, which is used to retrieve the RGB video frames and neural network (NN) inference results. These queues are named “rgb” and “nn” (remember we named them while creating the camera pipeline using .setStreamName()), with a maximum size of 4, and are set to non-blocking mode. In non-blocking mode, the oldest data will be overwritten with new data if the queue is full.

On Lines 59-63, several variables are initialized, such as

frame: for storing the current video frame being processed, startTime: the starting time for calculating the neural network FPS, using the time.monotonic() function, which returns the current time in seconds, fps: An instance of an FPS utility class is created and started to measure the frames per second of the video processing, counter: A counter variable, initialized to 0, which may be used to count frames or other events during processing,color2: A color tuple (255, 255, 255) representing white in BGR color space, which will be used to annotate video frames with FPS information.

print(“[INFO] starting inference with OAK camera…”)
while True:
# fetch the RGB frames and YOLO detections for the frame
inRgb = qRgb.get()
inDet = qDet.get()

if inRgb is not None:
# convert inRgb output to a format OpenCV library can work
frame = inRgb.getCvFrame()
# annotate the frame with FPS information
cv2.putText(frame, “NN fps: {:.2f}”.format(counter / (time.monotonic() – startTime)),
(2, frame.shape[0] – 4), cv2.FONT_HERSHEY_TRIPLEX, 0.8, color2)
# update the FPS counter
fps.update()

if inDet is not None:
# if inDet is not none, fetch all the detections for a frame
detections = inDet.detections
counter += 1

if frame is not None:
# annotate frame with detection results
frame = utils.annotateFrame(frame, detections, args.model.lower())
# display the frame with gesture output on the screen
cv2.imshow(args.model.lower(), frame)

# write the annotated frame to the file
out.write(frame)

# break out of the while loop if `q` key is pressed
if cv2.waitKey(1) == ord(‘q’):
break

# stop the timer and display FPS information
fps.stop()
print(“[INFO] elapsed time: {:.2f}”.format(fps.elapsed()))
print(“[INFO] approx. FPS: {:.2f}”.format(fps.fps()))
# do a bit of cleanup
out.release()
cv2.destroyAllWindows()

With the pipeline started and queues defined, we start the infinite while loop, which continuously fetches RGB frames and YOLOv8 detections from two different input queues (qRgb and qDet) on Lines 66-69.

On Lines 71-78, if an RGB frame is available:

It is converted to a format that the OpenCV library can process.Then annotated with the FPS (Frames Per Second) information. The FPS information is calculated using a counter, and the time elapsed since the start of the inference process. The FPS counter is updated every time an RGB frame is processed.

Next, on Lines 80-83, if a YOLO detection is available, the detections for that frame are fetched, and the counter is incremented.

Suppose a frame is available, and the detections for that frame have been fetched. In that case, the frame is annotated with the detection results using the annotateFrame() function from the utils module on Lines 85-87. The annotated frame is then displayed on the screen and written to an output file on Lines 89 and 93.

Finally, on Lines 96 and 97, we check if the user pressed the ‘q’ key, at which point the while loop is broken, and the inference process is stopped. The FPS information is then printed to the console, the output file is released, and the OpenCV windows are destroyed on Lines 100-105.

Results

We have done an amazing job implementing hand gesture recognition on the camera stream using OAK! And we are thrilled to share the fantastic inference results we achieved with both the YOLOv8n and YOLOv8s variants. Can you believe that YOLOv8s performed exceptionally well in recognizing all five hand gestures with barely any false positives or negatives? It is undoubtedly the clear winner here!

In terms of FPS, YOLOv8n achieved a speed of approximately 25 FPS, which is impressive. But wait, YOLOv8s is no slouch either, with a commendable speed of around 13 FPS.

The video below displays the hand gesture recognition inference results with the YOLOv8 Nano variant:

The video below displays the hand gesture recognition inference results with the YOLOv8 Small variant:

What’s next? I recommend PyImageSearch University.

Course information:
76 total classes • 90 hours of on-demand code walkthrough videos • Last updated: May 2023
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you’re serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you’ll find:

&check; 76 courses on essential computer vision, deep learning, and OpenCV topics
&check; 76 Certificates of Completion
&check; 90 hours of on-demand video
&check; Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
&check; Pre-configured Jupyter Notebooks in Google Colab
&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
&check; Access to centralized code repos for all 500+ tutorials on PyImageSearch
&check; Easy one-click downloads for code, datasets, pre-trained models, etc.
&check; Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

Congratulations! You have now completed this tutorial, and we hope it has provided valuable insights into creating a hand gesture recognition application on OAK-D using YOLOv8n and YOLOv8s object detectors.

We began by outlining the project structure and proceeded with a detailed walkthrough of the process, which included the following:

Exporting the YOLOv8 PyTorch model to OAKConfiguring prerequisitesDefining utilitiesCreating image and camera pipelinesDefining helper functions

Following this, we implemented hand gesture recognition in images and camera streams using the previously defined pipelines and utilities.

To conclude the tutorial, we compared the results of the YOLOv8n and YOLOv8s variants, emphasizing the superior performance of the YOLOv8s variant in camera streams.

Additionally, we highlighted the near real-time performance achieved with both YOLOv8n and YOLOv8s variants for hand gesture recognition in camera streams, demonstrating their effectiveness in practical applications.

Citation Information

Sharma, A. “Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time,” PyImageSearch, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2023, https://pyimg.co/92by6

@incollection{Sharma_2023_Hand-Gesture-Recognition-YOLOv8-OAK-D,
author = {Aditya Sharma},
title = {Hand Gesture Recognition with {YOLOv8} on {OAK-D} in Near Real-Time},
booktitle = {PyImageSearch},
editor = {Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki},
year = {2023},
url = {https://pyimg.co/92by6},
}

Want free GPU credits to train models?

We used Jarvislabs.ai, a GPU cloud, for all the experiments.
We are proud to offer PyImageSearch University students $20 worth of Jarvislabs.ai GPU cloud credits. Join PyImageSearch University and claim your $20 credit here.

In Deep Learning, we need to train Neural Networks. These Neural Networks can be trained on a CPU but take a lot of time. Moreover, sometimes these networks do not even fit (run) on a CPU.

To overcome this problem, we use GPUs. The problem is these GPUs are expensive and become outdated quickly.

GPUs are great because they take your Neural Network and train it quickly. The problem is that GPUs are expensive, so you don’t want to buy one and use it only occasionally. Cloud GPUs let you use a GPU and only pay for the time you are running the GPU. It’s a brilliant idea that saves you money.

JarvisLabs provides the best-in-class GPUs, and PyImageSearch University students get between 10-50 hours on a world-class GPU (time depends on the specific GPU you select).

This gives you a chance to test-drive a monstrously powerful GPU on any of our tutorials in a jiffy. So join PyImageSearch University today and try it for yourself.

Click here to get Jarvislabs credits now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you’ll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time appeared first on PyImageSearch.

 Table of Contents Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time Introduction Configuring Your Development Environment Need Help Configuring Your Development Environment? Project Structure YOLOv8 Model Export to OAK Configuring the Prerequisites Defining the Utilities Creating the Images…
The post Hand Gesture Recognition with YOLOv8 on OAK-D in Near Real-Time appeared first on PyImageSearch.  Read More Computer Vision, Embedded, OAK, Object Detection, OpenVINO, PyTorch, Tutorials, YOLO, computer vision, deep learning, embedded, object detection, opencv ai kit, openvino, pytorch, tutorials, yolo 

Leave a Reply

Your email address will not be published. Required fields are marked *