Sunday, 14 June 2020

Implementing Convolutional Neural Network (CNN) with DIY Machine Vision Module

By F. Kung
 
Last Updated: Updated 31 Dec 2021 

In this post I will share my journey in implementing a convolutional neural network (CNN) in my DIY machine vision module (MVM).  The MVM module in question is described in a previous post https://fkeng.blogspot.com/2016/01/machine-vision-module.html.  A picture of it is shown below.


The current version of the machine vision hardware comprises a low resolution VGA CMOS camera, paired to an ARM Cortex M7 micro-controller (MCU), with a frame rate of around 20 frame-per-second (fps).  More details of the machine vision hardware and software are described in the previous post.  The focus of this post is on the technical details of training a CNN using Google's Tensorflow framework, and porting the CNN model into custom C codes (and a bit of Assembly) that runs on the ARM Cortex M7 MCU in the machine vision module.  This system is what is typically called the Edge Artificial Intelligence (AI) processing, i.e. the computation of the AI algorithm is performed locally on the machine.   Before I proceed, I wish to clarify a few things:
  • This post is not a tutorial on neural network nor explanation on how to use Google TensorFlow machine learning library. So it is assumed the reader is already familiar with these topics.
  • The MVM is intended for use in a mobile robot for navigation or obstacle avoidance purpose.  Hence the CNN in the MVM is used to perform image analysis of each image frame captured by the camera to estimate the position of obstacles.  
  • The python codes for the CNN is developed using the Spyder IDE. However, any python IDE should be fine.
  • Here I am using TensorFlow V2.0 library [Update 15 June 2022, also tested to work well on V2.91 library].
  • If you are looking for a simple way to implement a powerful neural network or edge AI-based system, alternatives like ESP32-CAM, Pixy CAM or Open MV would be better.  You can also opt for commercial edge AI processor board, such as Jetson Nano, Google Coral development board or even Raspberry Pi.

1. Introduction

The idea is to have a machine vision module (MVM) that is attached to a mobile robot, and continuously scan the floor or surface in front of the robot at 20 fps.  For each captured image, the image processor in the MVM will run a CNN forward propagation calculation routine, trying to identify any object or obstacle in front of the robot.  To reduce the size of the CNN model, processing time and memory requirement, the following steps are taken:
  • Only the gray scale or luminance output of the image frame is used.
  • Resolution of the image is reduced to QQVGA, or 120x160 pixels.
  • Because the objective is to analyze detect obstacles in front of the mobile robot, only a small portion of the gray scale image of size 37x100 pixels is subject to the CNN analysis.
  • The CNN has 5 output classes: 'Left', 'Front', 'Right', 'Blocked' and 'No Object' (The corresponding label values are 1, 3, 2, 4 and 0).
With the steps above, it is possible for the CNN codes in the image processor to complete the analysis of an image frame in less than 50 miliseconds (for 20 fps frame rate) on my DIY Machine Vision Module (MVM). This idea is shown in Figure 1 below, with Figure 2 illustrating the 5 classes of output for the CNN. The software for the MVM is still a work-in-progress, in future when efficiency of the codes can be improved, more layers and output classes can be added (for instance a class for object on left and right sub-regions, but not in the front).  For now I find that with these 5 output classes, it is sufficient for the mobile robot to navigate in the environment.


Figure 1 - Screen capture of the MVM monitor software showing 'perspective' of the MVM mounted on a small mobile robot.  In the figure the algorithm in the MVM image processors highlights that an object/obstacle is present in the Right sub-region.








Figure 2 - Examples for 'Blocked' (label = 4), 'Front' (label = 3), 'Left' (label = 1), 'Right' (label = 2) and 'No Object' (label = 0) classification outputs.

The following sections will discuss the various topics needed to train the CNN model, export the weights and bias and implementation of the inference or forward propagation computation in C/Assembly codes.  Here are two short videos to show how the system works:

Video 1 - Demo of the system mounted on a small mobile robot (static).


Video 2 - Demo of the system with the mobile robot moving in autonomous mode.




2. Saving Image from Machine Vision Module (MVM) onto Computer Harddisk

In order to train the CNN using back propagation method with Tensorflow library, one need to feed in the image data and the label (of the class) to the CNN model created with Tensorflow.  Each image is stored as 2D NumPy array, with each elements of the 2D Numpy array represents the value of a pixel normalized to between 0.0 to 1.0.  The element value is stored as 32-bits floating point datatype by default, it is also possible to use a fixed point datatype since the value is between 0.0 to 1.0. In the case of color image we will have three 2D Numpy arrays (one for each RGB channel) for each image.  Normally we would not feed 1 image at a time to train the CNN, but a batch of images.  Suppose we have 500 gray scale images of resolution 50x100 pixels to be fed into the CNN model, a multi-dimensional Numpy array of dimension (500,50,100) would need to be created.  The 1st index points to the image number or sequence, with 2nd and 3rd indices refer to the pixel coordinate in the (x,y) sense.

A convenient method to fill up this multi-dimenstional Numpy array is to read the image file one-by-one from the computer harddisk using python Matplotlib.pyplot graphical plotting library.  The pyplot object contains a imread( ) method that can read a number of image format into a multi-dimensional Numpy array.  For this project as the resolution of the image is low, every single pixel is important.  Thus if we use image compression to reduce the image size in harddisk, it is important to use lossless compression approach.  Here I simply store each image in bitmap (BMP) format.  The MVM module can be linked to a monitor software, where the user can see the image captured by the MVM camera in real time.  The MVM monitor software has a function to export the raw gray scale image into BMP format and save it in the computer harddisk (See Figure 1).  Further information on bitmap format can be obtained from [1].  The MVM monitor software is written in Visual Basic .NET, the sourcecode in Listing 1 is an example of how to save gray scale image in 24-bits BMP format.  Here we assume there is a button called ButtonSaveBMP, and a SaveFileDialogbox object called SaveFileDialog1 has already been instantiated in the MVM monitor software codes.

Listing 1 - Visual Basic .NET subroutine to save bitmap file.
Private Sub ButtonSaveBMP_Click(sender As Object, e As EventArgs) Handles ButtonSaveBMP.Click
        Dim nYindex As Integer
        Dim nXindex As Integer
        Dim bytData(0 To (3 * mintImageWidth) - 1) As Byte   ' Size is 3x(no. of pixels per line)
        Dim nPixel As Integer
        Dim bytBITMAPFILEHEADER() As Byte = {&H42, &H4D, 0, 0, 0, 0, 0, 0, 0, 0, &H36, 0, 0, 0} 'Metadata, file header, 14 bytes.
        Dim bytBITMAPINFOHEADER() As Byte = {&H28, 0, 0, 0, &HA0, 0, 0, 0, &H78, 0, 0, 0, &H1, 0, &H18, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} 'Metadata, bitmap info header, 40 bytes.
        'There is also an optional color table metadata for BMP format which we did not use here.  The color table is only needed when BPP is less than 16 bits.


        Try
            mFilePath = TextBoxFileNum.Text

            If mFilePath <> "" Then              'Check if filename is valid.
                SaveFileDialog1.Title = "Save Bitmap File"
                SaveFileDialog1.CheckFileExists = False
                SaveFileDialog1.DefaultExt = "bmp"
                SaveFileDialog1.Filter = "bitmap files (*.bmp)|*.bmp"
                SaveFileDialog1.FileName = mFilePath
                If SaveFileDialog1.ShowDialog() = DialogResult.OK Then
                    mFilePath = SaveFileDialog1.FileName
                End If
            Else
                MessageBox.Show("Filename not valid", "ERROR", MessageBoxButtons.OK)
            End If

            If mFilePath <> "" Then ' Only proceed if filename is valid.
                My.Computer.FileSystem.WriteAllBytes(mFilePath, bytBITMAPFILEHEADER, False) 'False to overwrite the content.
                My.Computer.FileSystem.WriteAllBytes(mFilePath, bytBITMAPINFOHEADER, True) 'True to overwrite the content.
                For nYindex = 0 To mintImageHeight - 1
                    For nXindex = 0 To mintImageWidth - 1
                        nPixel = 2 * mbytPixel2(nXindex, mintImageHeight - 1 - nYindex) 'The original luminance value is between 0-127,
                        'here we multiply by 2 to normalize it to between
                        '0 to 255.
                        bytData(3 * nXindex) = nPixel        'Construct a grayscale pixel.
                        bytData((3 * nXindex) + 1) = nPixel     'Format is BGR. Make sure total bytes per line is divisible by 4.
                        bytData((3 * nXindex) + 2) = nPixel
                    Next
                    My.Computer.FileSystem.WriteAllBytes(mFilePath, bytData, True) 'True to overwrite the content.
                Next
            End If


        Catch ex As Exception
            MessageBox.Show("Save file: " & ex.Message, "ERROR", MessageBoxButtons.OK)
        End Try
    End Sub


3. File Organization

All the image files are sorted according the the output class and stored in separate folders.  Figure 3 shows the directory tree for the training images and test images.  Each image file is named as number, for instance in Figure 3 we see that in the Right sub-folder under the Train Image folder, we have bitmap image files 0.bmp, 1.bmp, 2.bmp and so on. Subsequently in the python codes for CNN model, we just need to concatenate the path and the filename to create a valid path to each image and import into a 2D Numpy array.  Listing 2 is an example of how this is done in python.

Figure 3 - Directory structure for storing the images.

Listing 2 is a python example of how to load all the bitmap image files in a sub-folder into a multi-dimensional numpy array.  Here it is assumed that all the bitmap image in the sub-folder are similar in size an the folder TrainImage is at the same level as the python code in the hard disk.  Each image is first read into a temporary 2D numpy array, it is then cropped to the required size (as determined by the constants _roi_startx, _roi_starty, roi_width, _roi_height), and transferred for storage in the multi-dimensional numpy array from training images. 

Listing 2 - Python codes to load a series of bitmap images file from computer hard disk to numpy array.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os 


#The start point, width and height of the region-of-interest (ROI) of each image that will be subject to
#analysis by the CNN.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37


train_dir = os.path.join('./TrainImage/Right')  #Create a path to the sub-folder with object on right in training image directory
train_names = os.listdir(train_dir)   #Create a list containing the filenames of all image files in the sub-folder.

train_num_files = len(os.listdir(train_dir))  #Count the number of files in the sub-folder.

#Create an empty 3D array to hold the sequence of 2D image data and 1D array to hold the labels
train_images = np.empty([train_num_files,_roi_height,_roi_width])
train_labels = np.empty([train_num_files])

i = 0
for train_image_file in train_names:  #Training images, no object.
    #Read original BMP image
    image = plt.imread(train_dir+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 3  #Label value for object on the right.
    i = i+1


4. The Complete CNN Architecture

To keep things simple I opt for a simple 4-layer structure as illustrated in Figure 4 (it could also be interpreted as a 3-layer structure depending on our definition of the 2D Max-Pooling function).  At the moment I find this structure adequate for my needs, it is also possible to add another dense neural network layer.  That is all I can fit into the bandwidth of the processor in the MVM for 20 fps operation or 50 ms interval.  If the frame rate is reduced to 10 fps, then we have 100 ms interval to process each frame and it is possible to add a second 2D convolution layer after the first convolution layer.  The complete python code, from loading the bitmap images, creating the CNN model using TensorFlow Keras API up to training the model is shown in Listing 3.


Figure 4 - The CNN structure adopted for this project.


Listing 3 - Python codes for instantiating and training the CNN model with TensorFlow Keras API.

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os

tf.keras.backend.clear_session()  # For easy reset of notebook state.

#Set the width and height of the input image in pixels.
_imgwidth = 160
_imgheight = 120

#Set the region-of-interest (ROI) start point and size.
#Note: The coordinate (0,0) starts at top left hand corner of the image frame.
_roi_startx = 30
_roi_starty = 71
_roi_width = 100
_roi_height = 37
_layer0_channel = 16  #Number of convolution kernel/filters.

_DNN1_node = 35      #Number of nodes for dense NN layer.
_DNN2_node = 5        #Number of output nodes.

train_dir = os.path.join('./TrainImage/NoObject') #Create a path to the folder for no object in training image directory
train_names = os.listdir(train_dir)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'no object': ", train_names)
print("")
train_num_files = len(os.listdir(train_dir))

train_dir2 = os.path.join('./TrainImage/Left') #Create a path to the folder with object on left in training image directory
train_names2 = os.listdir(train_dir2)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object on left': ", train_names2)
print("")
train_num_files2 = len(os.listdir(train_dir2))

train_dir3 = os.path.join('./TrainImage/Right') #Create a path to the folder with object on right in training image directory
train_names3 = os.listdir(train_dir3)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object on right': ", train_names3)
print("")
train_num_files3 = len(os.listdir(train_dir3))

train_dir4 = os.path.join('./TrainImage/Front') #Create a path to the folder with object in front in training image directory
train_names4 = os.listdir(train_dir4)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object in front': ", train_names4)
print("")
train_num_files4 = len(os.listdir(train_dir4))

train_dir5 = os.path.join('./TrainImage/Blocked') #Create a path to the folder with object blocking the front in training image directory
train_names5 = os.listdir(train_dir5)   #Create a list containing the filenames of all image files in the directory.
print("Training file names, 'With object in blocking the front': ", train_names5)
print("")
train_num_files5 = len(os.listdir(train_dir5))

#--- Load training images and attach label ---

#Create an empty 3D array to hold the sequence of 2D image data and
#1D array to hold the labels
train_images = np.empty([train_num_files + train_num_files2 + train_num_files3
                         + train_num_files4 + train_num_files5,_roi_height,_roi_width])
train_labels = np.empty([train_num_files + train_num_files2 + train_num_files3
                         + train_num_files4 + train_num_files5])


#Read BMP file, extract grayscale value, crop and fill into train_images
#Note: This can also be done using keras.image class, specifically the
#keras.image.load_image() and keras.image.img_to_array() methods.
i = 0
for train_image_file in train_names:  #Training images, no object.
    #Read original BMP image
    image = plt.imread(train_dir+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 0  #Label value for no object.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')
   
for train_image_file in train_names2:  #Training images, with object on left.
    #Read original BMP image
    image = plt.imread(train_dir2+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 1  #Label value for object on left.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')
   

for train_image_file in train_names3:  #Training images, with object on right.
    #Read original BMP image
    image = plt.imread(train_dir3+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 2  #Label value for object on right.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')
    
for train_image_file in train_names4:  #Training images, with object in front.
    #Read original BMP image
    image = plt.imread(train_dir4+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 3  #Label value for object in front.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')

for train_image_file in train_names5:  #Training images, with object blocking the front.
    #Read original BMP image
    image = plt.imread(train_dir5+'/'+train_image_file,format = 'BMP') #Can also use os.join function to create the complete filename.
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    train_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 5 samples)
    train_labels[i] = 4  #Label value for object blocking the front.
    i = i+1
    #Plot the training images
    #plt.figure(num=i) #1 image frame per figure.
    #plt.imshow(train_images[i-1],cmap='gray')
   
#--- Load test images and attach label ---
test_dir = os.path.join('./TestImage/NoObject')
test_names = os.listdir(test_dir)
print("Test file names, 'no object': ", test_names)
print("")
test_num_files = len(os.listdir(test_dir))

test_dir2 = os.path.join('./TestImage/Left')
test_names2 = os.listdir(test_dir2)
print("Test file names, 'with object on left': ", test_names2)
print("")
test_num_files2 = len(os.listdir(test_dir2))

test_dir3 = os.path.join('./TestImage/Right')
test_names3 = os.listdir(test_dir3)
print("Test file names, 'with object on right': ", test_names3)
print("")
test_num_files3 = len(os.listdir(test_dir3))

test_dir4 = os.path.join('./TestImage/Front')
test_names4 = os.listdir(test_dir4)
print("Test file names, 'with object in front': ", test_names4)
print("")
test_num_files4 = len(os.listdir(test_dir4))

test_dir5 = os.path.join('./TestImage/Blocked')
test_names5 = os.listdir(test_dir5)
print("Test file names, 'with object blocking the front': ", test_names5)
print("")
test_num_files5 = len(os.listdir(test_dir5))

#Read BMP file, extract grayscale value, crop and fill into train_images

#Create an empty 3D array to hold the sequence of 2D image data and labels

test_images = np.empty([test_num_files + test_num_files2 + test_num_files3 +
                        test_num_files4 + test_num_files5,_roi_height,_roi_width])
test_labels = np.empty([test_num_files + test_num_files2 + test_num_files3 +
                        test_num_files4 + test_num_files5])

i = 0
for test_image_file in test_names:  #Test images, no object.
    #Read original BMP image
    image = plt.imread(test_dir+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 0  #Label value for no object.
    i = i+1   
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')
   
for test_image_file in test_names2:  #Test images, with object on left.
    #Read original BMP image
    image = plt.imread(test_dir2+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 1  #Label value for with object on left.
    i = i+1   
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names3:  #Test images, with object.
    #Read original BMP image
    image = plt.imread(test_dir3+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 2  #Label value for with object on right.
    i = i+1   
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')

for test_image_file in test_names4:  #Test images, with object in front.
    #Read original BMP image
    image = plt.imread(test_dir4+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 3  #Label value for with object in front.
    i = i+1   
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')
   
for test_image_file in test_names5:  #Test images, with object blocking the front.
    #Read original BMP image
    image = plt.imread(test_dir5+'/'+test_image_file,format = 'BMP')
    #Extract only 1 channel of the RGB data, assign to 2D array
    imgori = image[0:_imgheight,0:_imgwidth,0]
    #Crop the 2D array to only the region of interest
    test_images[i] = imgori[_roi_starty:_roi_starty+_roi_height,_roi_startx:_roi_startx+_roi_width]
    #Fill up the label array (here each class has 20 samples)
    test_labels[i] = 4  #Label value for object blocking the front.
    i = i+1   
    #Plot the test images
    #plt.figure(num=i) #1 frame per figure.
    #plt.imshow(test_images[i-1],cmap='gray')   

train_images = train_images/256.0 #Normalize the training image array, and
                                  #convert to floating point.
                                  #Alternatively we can use:
#train_images = train_images.astype('float32')/256                                     
train_images=train_images.reshape(train_num_files + train_num_files2 + train_num_files3 +
                                  train_num_files4 + train_num_files5, _roi_height, _roi_width, 1)

test_images = test_images/256.0 #Normalize the test image array.
test_images=test_images.reshape(test_num_files + test_num_files2 + test_num_files3 +
                                test_num_files4 + test_num_files5, _roi_height, _roi_width, 1)

# Model - CNN with single convolution layer and single max-pooling layer, 2 DNN layers.
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(_layer0_channel, (3,3), strides = 2, activation='relu', input_shape=(_roi_height, _roi_width, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(_DNN1_node, activation='relu'),
    tf.keras.layers.Dense(_DNN2_node, activation='softmax')
    ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()


# Optional, generating a plot of the model. Requires pydot and graphviz to be
# installed. If the path to the graphic file "my_first_model.png" is not given, it
# will be saved to the same folder as this sourcecode.
 

tf.keras.utils.plot_model(model, 'CNN_model.png', show_shapes = True)
history = model.fit(train_images, train_labels, epochs=30)
model.evaluate(test_images, test_labels)

#Try classify a test image:
classifications = model.predict(test_images)
index1 = 3  #Index to the layers
print("Classify 1 test sample")
print(classifications[index1]) # This prints the values of output nodes for test sample pointed by index1
print(test_labels[index1])     # This is the result after evaluating all the 5 output nodes.


5. Exporting the CNN Weights

Once the neural network model is properly trained, we can then use it to classify a test image as shown in the last section of the codes in Listing 3, which perform the forward propagation calculation or inference with the model when tensorflow.keras.Model.predict( ) method is called.  However, instead of using the computer to execute the neural network model, I wish to run the neural network on my own processor.  To do this two pieces of information are needed:
  1. The configuration of the neural network, e.g. what each layer in the neural network contains, and how the layers are connected.
  2. A set of weights values (the "state of the model") for each layer, which specify the coefficient for each path from one node in current layer to another node in the subsequent layer.
The first information is easily obtained from the declaration of the CNN model. One way to get the second piece of information is to use the tensorflow.keras.Model.save_weights( ) method [2]. Calling this method will save the weights of all layers on a file, in HF5 format or tensorflow native format. We can then use another software to read the HF5 file. Another more direct approach is to invoke the get_weights( ) method in the tensorflow.keras.layers.Layer class, which output the weights of the layer in the form of a numpy array. Since each layers in the CNN model inherits this class, the get_weights( ) method can be accessed directly from our CNN model as shown in Listing 4.  The return value of get_weights( ) is a list of numpy arrays containing the weights or coefficients and bias for each layer. In Listing 4 we also extract the size (e.g. the number of nodes) for each layer, so that the configuration of the CNN can be dynamically changed. All these values can then be saved simply as a C/C++ header file in text format, so that I can read it into my C/C++ codes for my micro-controller firmware.  The second part of Listing 4 achieved this by opening a text file and write the elements of the numpy array into the text file using for-loops.

Listing 4 - Python codes accessing the weights of all layers using Tensorflow Keras API and saving the weights into a C style header file.
# Get the weights of all layers.
wt = model.get_weights()
 

# The highest level index to wt points to the coefficients or weights of each layer.
wtConv2D1 = wt[0]        # Weights of 1st convolution layer.
wtConv2D1bias = wt[1]    # Bias of 1st convolution layer.
wtDNN1 = wt[2]          # Weights of 1st DNN layer.
wtDNN1bias = wt[3]      # Bias of 1st DNN layer.
wtDNN2 = wt[4]          # Weights of 2nd DNN layer.
wtDNN2bias = wt[5]          # Bias of 2nd DNN layer.

Conv2D1filter = wtConv2D1.shape[3]  # get no. of filters in 1st convolution layer.
Flattennode = wtDNN1.shape[0]         # get no. of nodes after flatting the convolutional layer.
DNN1node = wtDNN1.shape[1]          # get no. of nodes in 1st DNN layer.
DNN2node = wtDNN2.shape[1]          # get no. of nodes in 2nd DNN layer.


# Open a text file for writing.
f = open("C:\CNN.h","w+")     # Header file to store the coefficients.

# Set the parameters of the filter and other constants in the CNN.
f.write("#define  __ROI_STARTX  %d \n" % _roi_startx)
f.write("#define  __ROI_STARTY  %d \n" % _roi_starty)
f.write("#define  __ROI_WIDTH  %d \n" % _roi_width)
f.write("#define  __ROI_HEIGHT  %d \n" % _roi_height)
f.write("#define  __FILTER_SIZE  3 \n")
f.write("#define  __FILTER_STRIDE  2 \n")
f.write("#define  __LAYER0_CHANNEL  %d \n" % _layer0_channel)
f.write("#define  __LAYER0_X  %d \n" % ((_roi_width-3)/2 + 1))
f.write("#define  __LAYER0_Y  %d \n" % ((_roi_height-3)/2 + 1))
f.write("#define  __DNN1NODE %d \n" % _DNN1_node)
f.write("#define  __DNN2NODE %d" % _DNN2_node)
f.write("\n\n")

N = 3   # Filter size, 3x3

f.write("const  int  gnL1f[%d][%d][%d] = { \n" % (Conv2D1filter,N,N))  # Integer version
for nfilter in range(Conv2D1filter):
    f.write("{")
    for i in range(3):
        f.write("{")
        for j in range(3):
            f.write("%d" % (wtConv2D1[i,j,0,nfilter]*1000000))  # Scaled integer version.
            if j < (N-1):
                f.write(", ")       # Add a comma and space after every number, except last number.
        f.write("}")       
        if i < (N-1):
            f.write(", ")
    if nfilter < (Conv2D1filter - 1):
        f.write("}, \n")
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias for Conv2D1
f.write("const  int  gnL1fbias[%d] = {" % Conv2D1filter)  # Integer version
for nfilter in range(Conv2D1filter):
    f.write("%d" % (wtConv2D1bias[nfilter]*1000000))  # Scaled integer version
    if nfilter < (Conv2D1filter-1):
        f.write(", ")
f.write("}; \n\n")

# DNN layer 1
# Weights
f.write("const  int  gnDNN1w[%d][%d] = { \n" % (Flattennode,DNN1node))      # Integer version.
for i in range(Flattennode):
    f.write("{")
    for j in range(DNN1node):
        f.write("%d" % (wtDNN1[i,j]*1000000)) # Scaled integer version.
        if j < (DNN1node - 1):
            f.write(", ")           # Add a comma and space after every number, except last number.
    if i < (Flattennode - 1):
        f.write("},\n")             # Add a newline and '}' after every row.
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias
f.write("const int gnDNN1bias[%d] = {" % DNN1node)  # Scaled integer veresion.
for i in range(DNN1node):
    f.write("%d" % (wtDNN1bias[i]*1000000))  #Scaled to integer.
    if i < (DNN1node - 1):
        f.write(", ")
f.write("}; \n\n")

# DNN layer 2
# Weights
f.write("const  int  gnDNN2w[%d][%d] = { \n" % (DNN1node,DNN2node))  # Scaled integer veresion.
for i in range(DNN1node):
    f.write("{")
    for j in range(DNN2node):
        f.write("%d" % (wtDNN2[i,j]*1000000))  # Scaled integer veresion.
        if j < (DNN2node - 1):
            f.write(", ")           # Add a comma and space after every number, except last number.
    if i < (DNN1node - 1):
        f.write("},\n")             # Add a newline and '}' after every row.
    else:
        f.write("} \n")
f.write("}; \n\n")

# Bias
f.write("const int gnDNN2bias[%d] = {" % DNN2node)
for i in range(DNN2node):
    f.write("%d" % (wtDNN2bias[i]*1000000))  # Scaled to integer veresion.
    if i < (DNN2node - 1):
        f.write(", ")
f.write("}; \n\n")

f.close()



An example of the weights produced by the code wt = model.get_weights( ) and the resulting header file are shown in Figure 5.  Some salient point concerning the header file:
  • Notice that I convert the values of the weigths and biases from 32-bit floating point to integer.  This is achieve by multiplying each floating point value with 1000000 to return 6 decimal places or 6 significant digits of the value.  The computation in the micro-controller will be performed using integer maths to increase the throughput.
  • All the weights and biases are declared using the constant modifier, this will force the C/C++ compiler to save the values in the non-volatile memory of the micro-controller, e.g. the Flash memory.  During the forward propagation calculation in the micro-controller, computation will be performed layer-by-layer.  Thus, only the weight and bias values pertinent to the layer concerned will be loaded from Flash memory to the micro-controller RAM, reducing the RAM memory demand on the micro-controller.



Figure 5 - Comparing the weights generated by Tensorflow for the CNN model and the content of the C/C++ header file exported (The header file is named "CNN.h").


6. Performing Inference Operation in Micro-Controller

Once the header file is generated, we can then incorporate it into our firmware sourcecode for the micro-controller (MCU) in the machine vision module.  The integer weights or coefficients for the CNN will be stored in the Flash memory of the MCU due to the const modifier.  The firmware in the MCU will first loads the required integer weights from the Flash memory into the RAM memory, and then using for-loops to calculate the output of each 2D convolution filter or nodes.  The complete high-level flow for this system is illustrated in Figure 6.

Figure 6 - High level view of implementing the CNN routines in machine vision module.

There are many approaches to implementing the C-language neural network inference subroutine in the MCU of the machine vision module. The approach that I used is shown in Figure 7. The inference subroutine needs to complete the execution within 50 ms for a frame rate of 20 fps. In actual implementation there are other tasks that need to be executed on a periodic basis in parallel with the neural network inference subroutine, such as tasks that compress and stream the pixels in the video buffer to external display, the camera driver and image pre-processing routines.  Hence, I actually break the flow into a series of smaller parts so that the inference routine will not hog the processor.  For example, in Figure 7, after calculating the output of each 2D convolution filter or node, we can return the control to the RTOS or scheduler so that other tasks can run.  All-in-all, from experiment, the other tasks require roughly 10 ms or 20% of the MCU bandwidth within 1 frame interval, with roughly 40 ms left for the neural network inference subroutine.


Figure 7 - Detailed flow of the C inference subroutine.

To speed things up here are the key points that in the C-language neural network inference subroutine, here are some features that I used:
  • Weigths and node values are stored as 32-bits signed integer in RAM.
  • All linear arithmetic operations use 32-bits signed integer operations.  The intermediate result of multiply-and-accumulation operation should be stored in 64-bits signed integer format to prevent overflow.  The final result will be normalized back to 32-bits signed integer.
  • Some codes in the neural network inference subroutine are implemented in assembly language, with call to subroutine using C inline directive.  For example in Listing 5, the function to calculate the 2D convolutional filter output is implemented using the __MALD assembly opcode [3] for ARM Cortex M7, which perform multiply and sum operation in one instruction cycle. 
  • Instead of using SoftMax function at the output as in Figure 7, I just used a sort algorithm to find the maximum value in 3rd layer, basically implementing a max( ) function, to get the output. Softmax( ) function is used during training of the network because it is differentiable, allowing it to be used with back-propagation algorithm. 

Listing 5 - A C-code listing for implementing 2D convolution operation and activation function in ARM Cortex-M7.

// Function to compute the 3x3 convolution operation on a small region
// of the image buffer.
// ni, nj = (x,y) coordinate of start pixel in 3x3 image patch.
// nFilA = Address of array containing the 9 coefficients of the 2D convolution kernel or filter.
// nBias = Bias value.
 

__INLINE int    nConv2D(int ni, int nj, int * nFilA, int nBias)
{
    int    nLuminance[9];
    int    nTemp;
  
    if (gnValidFrameBuffer == 1)                    // Check frame buffer data valid flag.  If equals 1 means gunImgAtt2[] data
    {
        nLuminance[0] = gunImgAtt2[ni][nj] & _LUMINANCE_MASK; // Extract the 7-bits luminance value.
        nLuminance[1] = gunImgAtt2[ni+1][nj] & _LUMINANCE_MASK;
        nLuminance[2] = gunImgAtt2[ni+2][nj] & _LUMINANCE_MASK;
        nLuminance[3] = gunImgAtt2[ni][nj+1] & _LUMINANCE_MASK;
        nLuminance[4] = gunImgAtt2[ni+1][nj+1] & _LUMINANCE_MASK;
        nLuminance[5] = gunImgAtt2[ni+2][nj+1] & _LUMINANCE_MASK;
        nLuminance[6] = gunImgAtt2[ni][nj+2] & _LUMINANCE_MASK;
        nLuminance[7] = gunImgAtt2[ni+1][nj+2] & _LUMINANCE_MASK;
        nLuminance[8] = gunImgAtt2[ni+2][nj+2] & _LUMINANCE_MASK;
    }
    else
    {
        nLuminance[0] = gunImgAtt[ni][nj] & _LUMINANCE_MASK; // Extract the 7-bits luminance value.
        nLuminance[1] = gunImgAtt[ni+1][nj] & _LUMINANCE_MASK;
        nLuminance[2] = gunImgAtt[ni+2][nj] & _LUMINANCE_MASK;
        nLuminance[3] = gunImgAtt[ni][nj+1] & _LUMINANCE_MASK;
        nLuminance[4] = gunImgAtt[ni+1][nj+1] & _LUMINANCE_MASK;
        nLuminance[5] = gunImgAtt[ni+2][nj+1] & _LUMINANCE_MASK;
        nLuminance[6] = gunImgAtt[ni][nj+2] & _LUMINANCE_MASK;
        nLuminance[7] = gunImgAtt[ni+1][nj+2] & _LUMINANCE_MASK;
        nLuminance[8] = gunImgAtt[ni+2][nj+2] & _LUMINANCE_MASK;
    }
      
    // Convolution or cross-correlation operation with 3x3 filter. We try to avoid using for-loop to speed up the computation.
    // Note: 24 April 2020, I have tried a few approaches, using C codes without for-loop. Verified that this method is the
    // fastest, from a few tens of microseconds to a few microseconds! This method forces the compiler to use the 32-bits signed
    // integer multiply and accumulate assembly instruction of the Cortex M7 core, making it the most efficient.
              
    nTemp = (*nFilA)*(*nLuminance);                 // Correlation operation.
    nTemp = __MLAD(*(nFilA+1),*(nLuminance+1),nTemp);    // Using assembly multiply and accumulate instruction.
    nTemp = __MLAD(*(nFilA+2),*(nLuminance+2),nTemp);
    nTemp = __MLAD(*(nFilA+3),*(nLuminance+3),nTemp);
    nTemp = __MLAD(*(nFilA+4),*(nLuminance+4),nTemp);
    nTemp = __MLAD(*(nFilA+5),*(nLuminance+5),nTemp);
    nTemp = __MLAD(*(nFilA+6),*(nLuminance+6),nTemp);
    nTemp = __MLAD(*(nFilA+7),*(nLuminance+7),nTemp);
    nTemp = __MLAD(*(nFilA+8),*(nLuminance+8),nTemp);
    nTemp = (nTemp/128) + nBias;                    // Add Bias term and normalized by 128. The luminance value
                                                    // ranges from 0 to 127, during training of the CNN we normalize
                                                    // by 128 to make it between 0 to 1.0.  So in inference we should
                                                    // do the same.
  
    // ReLu activation function
    if (nTemp < 0)
    {
        nTemp = 0;
    }  
    return nTemp;
}


7. Conclusion and Sample Files

The system is pretty versatile.  As of June 2020, I have collected around 200 images for the 5 classes and can achieve accuracy between 85 to 92% in actual usage.  I have also reduced the output class to two, and simply used the system to detect whether an obstacle is present or not in front of the machine vision module.  On another interesting note, with the two output class source codes, I have used this system to check for presence or absence of human face by just replacing the training images.  Unfortunately, due to the low resolution of the camera and the shallow neural network architecture, the system is not able to differentiate between human faces, merely acknowledging the absence or presence.  The python codes, sample training and test images can be obtained from MVM V1.5C github project repository here.

References

1. U. Hiwarale, "Bits to bitmaps: A simple walkthrough of bitmap image format", 2019. https://itnext.io/bits-to-bitmaps-a-simple-walkthrough-of-bmp-image-format-765dc6857393
2. TensorFlow online documentation, March 2020 version. https://www.tensorflow.org/guide
3. ARM Cortex-M7 devices generic user guide, 2015. https://developer.arm.com/documentation/dui0646/b/