Fetch Part 2: Training the Model

In this portion of the tutorial, you will:

Install TensorFlow.
Download pre-trained model weights (for transfer learning).
Train a dog toy finding model.
Visualize our model's performance live on Spot.

We'll follow this tutorial pretty closely, with slight modifications for data from Spot. We're following this example to demonstrate that most out-of-the-box models are easy to use with Spot. We have included the key Spot-specific tips you'll need along the way.

Installing TensorFlow, CUDA, and cuDNN

There are many, many excellent guides on how to do this. We will cover the basic steps here, but details for different systems are best found elsewhere.

Install NVIDIA drivers, CUDA, and cuDNN

We will not cover system-specific NVIDIA driver, CUDA, or cuDNN installation here. Use the links above for installation guides.
You must install CUDA version 10.1 and cuDNN version 7.6.x

CUDA and cuDNN versions other than this will not work

Ensure your NVIDIA driver is >= 418.39 to be compatible with CUDA 10.1, as specified in NVIDIA's compatibility tables here

Install TensorFlow

Enter your Spot API virtualenv

Replace my_spot_env with the name of the virtualenv that you created in the [Spot Quickstart Guide](../quickstart.md):

source my_spot_env/bin/activate

If that worked, your prompt should now have (my_spot_env) at the front.

Install TensorFlow

python3 -m pip install --upgrade pip
python3 -m pip install tensorflow-gpu==2.3.1 tensorflow==2.3.1 tensorboard==2.3.0 tf-models-official==2.3.0 pycocotools lvis
python3 -m pip uninstall opencv-python-headless

Tensorflow likes to install a non-GUI version of OpenCV, which will cause us problems later
We can safely uninstall it because we already installed OpenCV.

Test TensorFlow and CUDA installation

(my_spot_env) $ python3
>>> import tensorflow as tf
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available:  1

If this doesn't work for you, you'll need to figure out what is wrong with your installation. Common issues are:

NVIDIA drivers not working.
Wrong CUDA version installed (we are using 10.1).
Didn't install cuDNN or installed wrong cuDNN version.

Install TensorFlow Object Detection API

To make things easy, we're going to use a specific revision of the tensorflow models. More advanced users can download and compile protobufs themselves.

Download our package with the precompiled files and save it in the fetch folder
Unzip it:

unzip models-with-protos.zip

Install the object detection API:

cd models-with-protos/research
python3 -m pip install .

How did you make this package? »

Checked out the models repository at revision 9815ea67e2122dfd3eb2003716add29987e7daa1
Compiled protobufs with:

cd models/research
protoc object_detection/protos/*.proto --python_out=.

Copied and installed the object detection tf setup file:

cp object_detection/packages/tf2/setup.py .
python3 -m pip install .

Prepare Training Data

We'll now split our data into a training set and a test set, and then convert our XML labels into a format that TensorFlow accepts.

Partition the Data

We want to hold back some of our data from the training set so that we can test to ensure our model isn't grossly over-fitting the data.

The goal is to organize our data as follows:

dogtoy/
├── images
│   ├── left_fisheye_image_0000.jpg
│   ├── left_fisheye_image_0001.jpg
│   └── ...
└── annotations
    ├── test
    │   ├── right_fisheye_image_0027.xml
    │   └── ...
    └── train
        ├── right_fisheye_image_0009.xml
        └── ...

You could do this manually, but we'll use a script.

Download the script and put it in the ~/fetch folder.
Run the script:

cd ~/fetch
python3 split_dataset.py --labels-dir dogtoy/annotations/ --output-dir dogtoy/annotations/ --ratio 0.9

This copies your XML label files into the train and test folders, splitting them up randomly. The script copies just to be extra safe and not delete your painstakingly labeled files!

Create a Label Map

Create a file called label_map.pbtxt and put it in dogtoy/annotations with the following contents:

item {
    id: 1
    name: 'dogtoy'
}

Convert Labels to `.record` Format

TensorFlow takes a different format than labelImg produces. We'll convert using this script.

Download the script and place it in the fetch directory.
Run it:

python3 generate_tfrecord.py --xml_dir dogtoy/annotations/train --image_dir dogtoy/images --labels_path dogtoy/annotations/label_map.pbtxt --output_path dogtoy/annotations/train.record

Run it again for the test set:

python3 generate_tfrecord.py --xml_dir dogtoy/annotations/test --image_dir dogtoy/images --labels_path dogtoy/annotations/label_map.pbtxt --output_path dogtoy/annotations/test.record

Download Model for Transfer Learning

We don't want to train a model from scratch since that would take a long time and require huge amounts of input data. Instead, we'll get a model that is trained to detect lots of things and guide it to detect our dog-toy specifically.

Make a pre-trained-models folder:

mkdir dogtoy/pre-trained-models

We'll use the SSD ResNet50 V1 FPN 640x640 model. Download it into the ~/fetch/dogtoy/pre-trained-models folder.

The TensorFlow Model Zoo has lots of other models you could use.

Extract it:

cd dogtoy/pre-trained-models
tar -zxvf ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz

Train the Model

We're almost ready for training!

Set up model parameters

Make a folder for our new model:

cd ~/fetch/dogtoy
mkdir -p models/my_ssd_resnet50_v1_fpn

Copy the pre-trained model parameters:

cp pre-trained-models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/pipeline.config models/my_ssd_resnet50_v1_fpn/

Open models/my_ssd_resnet50_v1_fpn/pipeline.config and change:

num_classes to 1

batch_size to 4

fine_tune_checkpoint to "dogtoy/pre-trained-models/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0"

fine_tune_checkpoint_type to "detection"

Under train_input_reader:
    label_map_path: "dogtoy/annotations/label_map.pbtxt"
    input_path: "dogtoy/annotations/train.record"

Under eval_input_reader:
    label_map_path: "dogtoy/annotations/label_map.pbtxt"
    input_path: "dogtoy/annotations/test.record"

To help keep you on track, here is a checklist:

Train the Model

Copy the training script into a more convenient location:

cd ~/fetch
cp models-with-protos/research/object_detection/model_main_tf2.py .

Start training:

python3 model_main_tf2.py --model_dir=dogtoy/models/my_ssd_resnet50_v1_fpn --pipeline_config_path=dogtoy/models/my_ssd_resnet50_v1_fpn/pipeline.config --num_train_steps=20000

If all goes well after a while, you'll start seeing output like this:

INFO:tensorflow:Step 100 per-step time 0.340s loss=6.323
INFO:tensorflow:Step 200 per-step time 0.352s loss=6.028
INFO:tensorflow:Step 300 per-step time 0.384s loss=5.854

If it doesn't go well, see the troubleshooting section below.
We can monitor things using nvidia-smi. In a new terminal:

watch -n 0.5 nvidia-smi

Finally, in a new terminal, we can use tensorboard to see outputs:

source my_spot_env/bin/activate
cd ~/fetch
tensorboard --logdir=dogtoy/models --bind_all

and point your browser to: http://localhost:6006

Evaluation During Training

During training, we can evaluate our network’s current state against our test set.

In a new terminal:

source my_spot_env/bin/activate
cd ~/fetch
CUDA_VISIBLE_DEVICES="-1" python3 model_main_tf2.py --model_dir=dogtoy/models/my_ssd_resnet50_v1_fpn --pipeline_config_path=dogtoy/models/my_ssd_resnet50_v1_fpn/pipeline.config --checkpoint_dir=dogtoy/models/my_ssd_resnet50_v1_fpn

Let's break down what this is doing.

CUDA_VISIBLE_DEVICES="-1"

Set an environment variable that tells CUDA that no GPUs are available.

We do this because the training is using all of our GPU memory.
This forces the evaluation to run on the CPU, where we still have memory available.
CPU will be slow, but that's okay since it will only evaluate the model once every 5 minutes or so.

model_main_tf2.py --model_dir=dogtoy/models/my_ssd_resnet50_v1_fpn --pipeline_config_path=dogtoy/models/my_ssd_resnet50_v1_fpn/pipeline.config

Same as above.

--checkpoint_dir=dogtoy/models/my_ssd_resnet50_v1_fpn

Tell the script to run in evaluation mode and point it to our training output.

Once this is running, you'll see new charts in TensorBoard that show results from evaluating on our training set, once per 1,000 steps of training.

These results are more trustworthy because they are from images the network has not seen during training.

What do these numbers mean? »

mAP (mean average-precision) is a metric that describes (a) how much the model's predicted bounding boxes overlap with your labeled bounding boxes and (b) how often the class labels are wrong.
Take a look at the section How do I measure the accuracy of a deep learning object detector? of this post for a good explanation.

Why do I care? »

You can use these graphs as a way to tell if your model is done training.
Feel free to ^C training if it looks like you're done. Checkpoints are automatically saved.

TensorBoard provides an Images tab that will graphically show you your model's results. Take a look.

Troubleshooting

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

You may be out of GPU memory. Often in that case, you'll see Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Run with the environment variable: TF_FORCE_GPU_ALLOW_GROWTH="true"

You can set the environment variable right before the training line:

TF_FORCE_GPU_ALLOW_GROWTH="true" python3 model_main_tf2.py --model_dir=dogtoy/models/my_ssd_resnet50_v1_fpn --pipeline_config_path=dogtoy/models/my_ssd_resnet50_v1_fpn/pipeline.config --num_train_steps=20000

Try closing items shown in nvidia-smi to free GPU memory.
See the tensorflow documentation for other options.

A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used

We are failing to load the pre-trained model. Check that fine_tune_checkpoint_type is set to "detection"

Last message was: Use tf.cast instead and then nothing happens.

Double check that your train.record file isn't empty:

ls -lah ~/fetch/dogtoy/annotations/train.record


-rw-r--r-- 1 user users 9.3M Mar  2 14:17 /home/user/fetch/dogtoy/annotations/train.record
                        ^^^^<---- not zero!

Once training is complete, head to Part 3 to take the latest checkpoint, convert it to an online model, and visualize the results.

Head over to Part 3: Evaluate the Model >>

<< Previous Page | Next Page >>

Fetch Part 2: Training the Model

Installing TensorFlow, CUDA, and cuDNN

Install NVIDIA drivers, CUDA, and cuDNN

Install TensorFlow

Enter your Spot API virtualenv

Install TensorFlow

Test TensorFlow and CUDA installation

Install TensorFlow Object Detection API

Prepare Training Data

Partition the Data

Create a Label Map

Convert Labels to .record Format

Download Model for Transfer Learning

Train the Model

Set up model parameters

Train the Model

Evaluation During Training

Troubleshooting

Head over to Part 3: Evaluate the Model >>

Convert Labels to `.record` Format