Home > Other >Special Application Circuits > Design of TinyML Image Classification Camera Based on ESP32

Design of TinyML Image Classification Camera Based on ESP32

Source: InternetPublisher:无人共我 Keywords: Camera ESP32 Updated: 2024/06/03

Background of the project

We are facing an increasingly embedded machine learning revolution. And when we talk about machine learning (ML), the first thing that comes to mind is image classification, a kind of ML Hello World!

One of the most popular and affordable development boards with an integrated camera is the ESP32-CAM, which combines the Espressif ESP32-S MCU chip with the ArduCam OV2640 camera.

The ESP32 chip is powerful enough to even process images. It includes I2C, SPI, UART communications as well as PWM and DAC outputs.

parameter:

Working voltage: 4.75-5.25V

Splash: Default 32Mbit

RAM: Internal 520KB + External 8MB PSRAM

Wireless network: 802.11b/g/n/e/i

Bluetooth: Bluetooth 4.2BR/EDR and BLE standards

Supported interfaces (2Mbps): UART, SPI, I2C, PWM

Support TF card: maximum support 4G

I口：9

Serial port rate: default 115200bps

Spectrum range: 2400 ~2483.5MHz

Antenna type: Onboard PCB antenna, gain 2dBi

Image output formats: JPEG (only supports OV2640), BMP, GRAYSCALE

Below, the general circuit board pinout:

Please note that this device does not have an integrated USB-TTL serial module, so to upload code to the ESP32-CAM you need a special adapter, as shown below:

Or USB-TTL serial conversion adapter as follows:

If you want to learn about the ESP32-CAM, I highly recommend Rui Santos's books and tutorials.

Install ESP32-Cam on Arduino IDE

From the Arduino IDE open the preferences window and go to: Arduino > Preferences

Use the following line input:

https://dl.espressif.com/dl/package_esp32_index.json

Enter the following content in Additional Board Manager URLs

Next, open the boards manager by going to Tools > Board > Boards Manager.. and enter using esp32. Select and install the latest package

Select ESP32 development board:

For example, AI-Thinker ESP32-CAM

Finally, don't forget to select the port to which the ESP32-Cam is connected.

That's it! The device should be fine. Let's do some testing.

Testing the board with BLINK
The ESP32-CAM has a built-in LED connected to GPIO33. So, change the Blink sketch accordingly:

#define LED_BUILT_IN 33 void setup() { pinMode(LED_BUILT_IN, OUTPUT); // Set the pin as output } // Remember that the pin work with inverted logic // LOW to Turn on and HIGH to turn off void loop() { digitalWrite(LED_BUILT_IN, LOW); //Turn on delay (1000); //Wait 1 sec digitalWrite(LED_BUILT_IN, HIGH); //Turn off delay (1000); //Wait 1 sec }

Special reminder, the LED is located under the circuit board.

Testing WiFi
One of the ESP32S's slick features is its WiFi capability. So, let's test its radio by scanning for WiFi networks around it. You can do this running one of the code examples that comes with the board.

Go to Arduino IDE Examples and look for WiFI ==> WiFIScan

On the serial monitor you should see the wifi networks in range of the device (SSID and RSSI). This is what I get at home:

Testing the Camera
For camera testing you can use the following code:

Examples ==> ESP32 ==> Camera ==> CameraWebServer

Choose only the right camera:

#define CAMERA_MODEL_AI_THINKER

And enter using your network credentials:

const char* ssid = "*********";
const char* password = "*********";

On the serial monitor you will get the correct address to run the server where you can control the camera:

Here I entered: http://172.16.42.26

Running your web server

So far we can test all the ESP32-Cam hardware (MCU and camera) as well as the wifi connection. Now, let's run a simpler code to capture a single image and present it on a simple web page. This code is based on Rui Santos' great tutorial: ESP32-CAM Take Photo and Display in Web Server Development

Download the file from GitHub: ESP332_CAM_HTTP_Server_STA, change the wifi credentials and run the code. The result is as follows:

Try checking the code; it's easier to understand how the camera works.

Fruits and Vegetables - Image Classification

Now that we have our embedded camera running, it’s time to try image classification.

We should start training the model and proceed with inference on the ESP32-CAM. We need to find a large amount of data for training the model.

TinyML is a set of technologies related to machine learning inference on embedded devices. Due to limitations (mainly memory in this case), we should limit classification to three or four categories. We distinguish apples from bananas and potatoes (you can try other categories).

So let's find a specific dataset that contains images of these categories. Kaggle is a good start:

https://www.kaggle.com/kritikseth/fruit-and-vegetable-image-recognition

The dataset contains images of the following food items:

Fruits-bananas, apples, pears, grapes, oranges, kiwis, watermelons, pomegranates, pineapples, mangoes.

Vegetables - cucumber, carrot, pepper, onion, potato, lemon, tomato, radish, beetroot, cabbage, lettuce, spinach, beans, cauliflower, bell pepper, capsicum, radish, corn, sweet corn, sweet potato, paprika, jalapeno, ginger, garlic, peas, eggplant.

Each class is divided into training (100 images), testing (10 images), and validation (10 images).

Download the dataset from the Kaggle website to your computer.

Training models using Edge Impulse Studio

We will be using Edge Impulse for training, the leading development platform for machine learning on edge devices.

Enter your account credentials at Edge Impulse (or create a free account). Next, create a new project:

data collection

Next, in the Upload Data section, upload files of the selected category from your computer:

If you end up with three categories of data, reading the training will help

You can also upload additional data for further model testing or to split your training data.

Impulse Design

Pulse takes raw data (in this case, images), extracts features (resizes the images), and then uses a learning block to classify new data.

As mentioned before, classifying images is the most common use of deep learning, but it requires a lot of data to accomplish this task. We have about 90 images per class. Is this enough? Not at all! We will need thousands of images to "teach or model" the difference between an apple and a banana. However, we can solve this problem by retraining a previously trained model using thousands of images. We call this technique "Transfer Learning" (TL).

Using TL, we can fine-tune a pre-trained image classification model on our data and achieve good performance even on relatively small image datasets (our case).

So, starting with our original images, we will resize them to (96x96) pixels and then feed them into our transfer learning block:

Preprocessing (feature generation)

In addition to resizing the images, we should also change them to grayscale to keep the actual RGB color depth. Doing so, each of our data samples will have 9 dimensions, 216 features (96x96x1). Keeping it in RGB, this dimension would be three times larger. Using grayscale helps reduce the final amount of memory required for inference.

Don't forget to "Save Parameters". This will generate the features to be used in training.

Training (transfer learning and data augmentation)

In 2007, Google introduced MobileNetV1, a family of general-purpose computer vision neural networks designed for mobile devices that supports classification, detection, etc. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of various use cases.

Although the basic MobileNet architecture is already small and has low latency, there are many times when a specific use case or application may require a smaller and faster model. To build these smaller and less computationally expensive models, MobileNet introduces a very simple parameter α (alpha), called the width multiplier. The width multiplier α has the effect of uniformly thinning the network at each layer.

Edge Impulse Studio provides MobileNet V1 (96x96 images) and V2 (96x96 and 160x160 images) with multiple different alpha values (from 0.05 to 1.0). For example, you will get the highest accuracy with V2, 160x160 images, and alpha=1.0. Of course, there is a trade-off. The higher the accuracy, the more memory is required to run the model (about 1.3M RAM and 2.6M ROM), which means more latency.

At the other extreme, using MobileNet V1 with α=0.10 (approximately 53.2K RAM and 101K ROM) results in an even smaller footprint.

To run this project on the ESP32-CAM, we should stay on the lower end of the possibilities, guaranteeing the case for inference but not guaranteeing high accuracy.

Another necessary technique used with deep learning is data augmentation. Data augmentation is a method that can help improve the accuracy of machine learning models. Data augmentation systems make small, random changes to the training data during the training process (such as flipping, cropping, or rotating images).

Here you can see how Edge Impulse implements data augmentation strategies on your data:

# Implements the data augmentation policy
def augment_image(image, label):
# Flips the image randomly
image = tf.image.random_flip_left_right(image)
# Increase the image size, then randomly crop it down to
# the original dimensions
resize_factor = random.uniform(1, 1.2)
new_height = math.floor(resize_factor * INPUT_SHAPE[0])
new_width = math.floor(resize_factor * INPUT_SHAPE[1])
image = tf.image.resize_with_crop_or_pad(image, new_height, new_width)
image = tf.image.random_crop(image, size=INPUT_SHAPE)
# Vary the brightness of the image
image = tf.image.random_brightness(image, max_delta=0.2)
return image, label

Exposure to these changes during training can help prevent the model from taking shortcuts by “memorizing” surface cues from the training data, meaning it can better reflect deeper underlying patterns in the dataset.

The last layer of our model will have 16 neurons with 10% dropout to prevent overfitting. Here is the training output:

The results are not great. The model achieves about 77% accuracy, but the amount of RAM memory used during inference is expected to be very small (about 60 KB), which is very good.

deploy

The trained model is deployed as a .zip Arduino library for the specific ESP32-Cam code.

Open your Arduino IDE and under Sketch, go to Include Library and add the .ZIP Library. Select the file you just downloaded from Edge Impulse Studio and that's it!

Under the Examples tab in the Arduino IDE, you should find a sketch code under the project name.

Open static buffer example:

You can see that the first code calls a library that has everything you need to run inference on your device.

#include 《ESP32-CAM-Fruit-vs-Veggies_inferencing.h》

Of course, this is a generic code (a "template") that just takes a sample of raw data (stored in the variable: features = {} and runs the classifier, doing inference. The results are displayed on the serial monitor.

What we should do is take samples (images) from the camera, preprocess it (resize to 96x96, convert to grayscale and flatten it. This will be the input tensor for our model. The output tensor will be a tensor containing three values showing the probability of each class.

On the website: https://github.com/edgeimpulse/example-esp32-cam, Edge impulse adapted the code that can be used for camera testing (Examples ==> ESP32 ==> Camera ==> CameraWebServer), including the necessary to run inference on the ESP32-CAM. On GitHub, download the code Basic-Image-Classification, include your project library, select your camera and your wifi network credentials:

Upload the code to your ESP32-Cam and you should be able to start sorting fruits and vegetables! You can check it on the serial monitor:

Testing the model (inference)

Take a picture with the camera, and the classification results will appear on the serial monitor:

The images captured by the camera can be verified on the web page:

Other tests:

in conclusion

The ESP32-Cam is a very flexible, inexpensive, and easy to program device. This project serves to demonstrate the potential of TinyML, but I am not sure if the overall results can be applied to real applications (in a developed way). Only the smallest transfer learning model worked properly (MobileNet V1, α=0.10), and any attempt to use a larger α to improve accuracy resulted in Arena allocation errors. One of the possible reasons is the amount of memory already used in the final common code to run the camera. Therefore, the next step of the project is to optimize the final code to free up more memory for running the model.

无人共我

Latest Other Circuits

Popular Circuits

Popular Components