Explainable image classification using Faster R-CNN and Grad-Cam

As neural networks are black-box models, it is hard to interpret the predictive results generated by them. Most of the deep learning models are based on neural networks and so the working of those deep learning models also becomes a black box. To explain the results generated by deep learning models, different techniques are used to make it a little interpretable. Grad-Cam is such an algorithm applied with CNN models to make computer vision-based predictions explainable. In this article, we will discuss how we can simply apply Grad-CAM methods with the Faster R-CNN in the PyTorch environment and make the image classification explainable. The major points to be discussed in this article are listed below.

Table of contents

Let’s begin with understanding the Grad-Cam algorithm.

What is Grad-CAM?

In one of our articles, we have discussed the Grad-CAM algorithm that does it make many computer vision works explainable. We could understand that it is a way to make CNN models interpretable. We can discrete the words Grad-CAM as Gradient Weighted class Activation Map. To make newly made CNNs or pre-trained CNNs interpretable, Grad-CAM applies a heat map on the images and using this heatmap the Grad-CAM shows what are pixels from the image required by the model to classify the objects in the image.

In this article, we are going to discuss how we can apply Grad_CAM methods on any pre-trained models using the PyTorch library. We can find an implementation of Grad-CAM in PyTorch here.

Talking about the implementation, this implementation includes various pixel attribution methods, and also this implementation is capable of working with classification, object detection, and semantic segmentation. Along with this, we can utilize this implementation with many CNN networks and vision transformers(ViT). With this implementation, we also get some of the modules that can help us define objects which are required by the Grad-CAM methods. In this article, we will discuss how we apply Grad-CAM for object detection with a faster R-CNN model.

We can install this implementation in our environment using the following lines of codes:

!pip install grad-cam

After installation, we are ready to work with Grad-CAM methods to make the CNN models interpretable.

Explaining image classification using Grad-Cam

Let’s start the process by importing modules from the package that we have installed.

import cv2import numpy as npimport torchimport torchvision

Prediction function

The below function will help us in defining the model that we are going to use and the size of the output tensor and predict the class name, label, score of prediction.

Explainable image classification using Faster R-CNN and Grad-Cam

def predict(input_tensor, model, device, detection_threshold): outputs = model(input_tensor) pred_classes = [coco_names[i] for i in outputs[0]['labels'].cpu().numpy()] pred_labels = outputs[0]['labels'].cpu().numpy() pred_scores = outputs[0]['scores'].detach().cpu().numpy() pred_bboxes = outputs[0]['boxes'].detach().cpu().numpy()boxes, classes, labels, indices = [], [], [], [] for index in range(len(pred_scores)):if pred_scores[index] >= detection_threshold:boxes.append(pred_bboxes[index].astype(np.int32))classes.append(pred_classes[index])

Drawing box

The below function will help us in defining a box on the basis of predictions that our model is making.

def draw_boxes(boxes, labels, classes, image): for i, box in enumerate(boxes):color = COLORS[labels[i]]cv2.rectangle(image,(int(box[0]), int(box[1])),(int(box[2]), int(box[3])),color, 2)cv2.putText(image, classes[i], (int(box[0]), int(box[3] - 5)),cv2.FONT_HERSHEY_SIMPLEX, 0.8, color, 2,lineType=cv2.LINE_AA) return image

Defining class names

coco_names = ['__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', \'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A','stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep','cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella','N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard','sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard','surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork','knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange','broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch','potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet','N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave','oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase','scissors', 'teddy bear', 'hair drier', 'toothbrush']

Defining different colour box

COLORS = np.random.uniform(0, 255, size=(len(coco_names), 3))

After defining all the above setup we are ready to make the Faster R-CNN model predictable.

Importing image

To make the model predict the class of objects in the image we are using the below image.

from PIL import Imageimage = np.array(Image.open("/content/download (2).jfif"))Image.fromarray(image)


Here we can see that we have a cat and dog in the image. Now we are required to define the correct setup for the image and model that will help us in making predictions.

Using the below lines of codes we can transform call and transform the image.

import torchvisionimage_float_np = np.float32(image) / 255transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(),])

In the above, we have defined the prediction function where we have defined boxes, classes, labels, and indices.

Modelling and predicting

We are going to make a Faster CNN model prom torch vision to predict what the classes are present in the image using the bounding boxes.

input_tensor = transform(image)device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')input_tensor = input_tensor.to(device)input_tensor = input_tensor.unsqueeze(0) model = torchvision.models.detection.fasterR-CNN_resnet50_fpn(pretrained=True)model.eval().to(device)

Let’s visualize the results.

boxes, classes, labels, indices = predict(input_tensor, model, device, 0.9)image1 = draw_boxes(boxes, labels, classes, image)Image.fromarray(image)


Here we can see that we have got good results from the model, predicted classes are right.

Now in the above output, we can see that the results are good from the model, but what happened in the background is not known. We can consider the whole background process as a black-box process. In development, it may happen that the model is good with some of the samples and worse with some others.This can cause a huge loss of accuracy of the model. To make the model more accurate we require improved interpretability of the models.

Applying Grad-CAM

So in the case of CNN and ViT models, Grad-CAM comes to save us. Using the methods under Grad CAM we can check what pixels are responsible for predictions of models. In this article, we are going to see one of the methods from Grad-CAM named EigenCAM. Using the following lines of codes we can perform this.

from pytorch_grad_cam import AblationCAM, EigenCAMfrom pytorch_grad_cam.ablation_layer import AblationLayerFasterR-CNNfrom pytorch_grad_cam.utils.model_targets import FasterR-CNNBoxScoreTargetfrom pytorch_grad_cam.utils.reshape_transforms import fasterR-CNN_reshape_transformfrom pytorch_grad_cam.utils.image import show_cam_on_image, scale_accross_batch_and_channels, scale_cam_imagetarget_layers = [model.backbone]targets = [FasterR-CNNBoxScoreTarget(labels=labels, bounding_boxes=boxes)]cam = EigenCAM(model,target_layers, use_cuda=torch.cuda.is_available(),reshape_transform=fasterR-CNN_reshape_transform) grayscale_cam = cam(input_tensor, targets=targets)grayscale_cam = grayscale_cam[0, :]cam_image = show_cam_on_image(image_float_np, grayscale_cam, use_rgb=True)image_with_bounding_boxes = draw_boxes(boxes, labels, classes, cam_image)Image.fromarray(image_with_bounding_boxes)


Here in the above, we can see that to make the predictions using the objects in the image the models have gone through the pixels from the face mainly and the intensity of the colours in the heat map represents that the model has mainly used only those pixels to make predictions.

Also in the above codes, we can find that we have applied the Grad-CAM on the model.backbone layer. Since these layers compute the meaningful activation it is suggested to apply Grad-CAM methods in this portion of the Faster R-CNN model.

Final words

In the article, we introduced the Grad-CAM method to make the CNN or the CNN-like models interpretable in the context of their working procedure. We also discussed how we can apply them on Faster R-CNN using the PyTorch library.