Implementing Object Detectors : Input Pipeline

I am currently in the process of implementing some the most cutting edge object detection algorithms here and am going to blog my experience implementing them.

This would be a great opportunity for someone not only to learn how deep learning algorithms are implemented but also informs about what it takes to get something research based like this, production ready.

We are going to talk about our input pipeline today. More specifically how we load our data set in Pytorch and generate targets for our machine learning model.

Loading the Dataset

We are going to be using the Coco Dataset to train the object detection model.

The implementation points are:

Load images and bounding box
Get images in batches.
Generate Anchors for each Image

Pytorch provides an easy API to implement both of these.

Prerequisites

Install these libraries:

pytorch
pycocotools
torchvision

Loading Image ids from COCO

The COCO folks have made a library called PyCOCO library. There is an object called COCO through which the dataset is accessed.

# path to Coco Dataset
annFile='/annotations/instances_val_2014.json'

# Loading dataset
coco=COCO(annFile)

This code snippet gets the image IDs, this function return either all of the dataset or only n number of ids. This n is gven by sample in the function.

def get_image_ids(sample=None):
    # get all category ids
    catIds = coco.getCatIds()
    imgIds = []
    # appending image ids of each category
    for id in catIds:
       imgIds += coco.getImgIds(catIds=id)

    if sample == None:
        # Returning the entire dataset
        return imgIds
    else:
        # Returning a subset of the dataset
        return imgIds[:sample]

Once we have our image ids, we need to load the images and bounding box coordinates.

Note: In coco , one image can have multiple bounding boxes

Loading Image and Bounding Boxes from COCO

To load images and bounding boxes. We do the following. For a specific id.

def loadImageNode(id):
    # get image id 
    img_id = self.imageIds[idx]
    image_node = self.coco.loadImgs(img_id)[0]

    return image_node

The image Node will be of the following format

{
    "date_captured": "2013-11-20 05:50:03", 
    "height": 512,
    "width": 640,
    "license": 1,
    "file_name": "COCO_val2014_000000262148.jpg",
    "coco_url": "http://images.cocodataset.org/val2014/COCO_val2014_000000262148.jpg", 
    "id": 262148, 
    "flickr_url": "http://farm5.staticflickr.com/4028/4549977479_547e6b22ae_z.jpg"
}

Now , from this image node we get the bounding boxes for them with the following code

def getBoundingBoxCoords(id):
    annIds = coco.getAnnIds(imgIds=id, iscrowd=None)
    anns = coco.loadAnns(annIds)
    coordsList = []
    for a in anns:
        coordsList.append(a['bbox'])
    
    return coordsList

We get a response as follows for bounding boxes. We extract the bbox array from this json response and return the list of bounding box coordinates

[
    {
        "bbox": [132.99, 21.03, 315.59, 110.4], 
        "segmentation": "[<segmentation-mask>]",
        "iscrowd": 0, 
        "area": 15019.191850000001, 
        "id": 134709, 
        "category_id": 3, 
        "image_id": 262161
    }
]

Loading Dataset in Batches

Pytorch provides a really easy to follow dataset generating module. The entire code for loading the dataset in batches is goven below. It is quite self explanatory.

# Importing the Pytorch Dataset Module
class CocoDatasetObjectDetection(Dataset):
    """Object detection dataset."""

    def __init__(self, imgIds, coco, transform=None):
        """
        Args:
            imgIds (string): image ids of the images in COCO.
            coco (object): Coco image data helper object.
            transform (callable, optional): Optional transform to be applied on a sample.
        """
        self.coco = coco
        self.imageIds = imgIds
        self.transform = transform
        self.bounding_box_mode = bounding_box_mode

    def write_to_log(self,sentence,filename):
        logger.error(str(sentence) + ' writing to log')
        self.word_err_file = open(filename, "a+")
        self.word_err_file.write(sentence + '\n')
        

    def __len__(self):
        return len(self.imageIds)

    def __getitem__(self, idx):
        # logger.warning("Generating sample of annotation id: " + str(self.annIds[idx]))

        img_id = self.imageIds[idx]
        image_node = self.coco.loadImgs(img_id)[0]

        return image_node


# Load the Dataset here
def main():
    batch_size = 1
    imgIds = get_image_ids()
    train_ids, val_ids = get_test_train_split(imgIds,percentage=0.05)
    train_ids = train_ids[:10]
  
    train_dataset = CocoDatasetObjectDetection(
                    imgIds=train_ids,
                    coco=coco
                )
    train_dataloader = torch.utils.data.DataLoader(
                    train_dataset,
                    batch_size=batch_size,
                    shuffle=True,
                    num_workers=1,
                    collate_fn=collate_fn
                )

    for i,img_nodes in enumerate(train_dataloader):
        
        # do extra stuff here

Anchors

What are anchors ?

Anchors are meant to denote a set of regions in an image. Anchors are simply rectangular crops of the image. These anchors can be of different sizes and aspect ratios. In practice, at a single pixel in an image, 3 areas with 3 aspect ratios each are cropped out. The authors use areas/scales of 32px, 64px and 128px and aspect ratios of :1, 1:2, 2:1. Hence , 9 anchors are cropped out from each pixel of the image. In the paper, anchors are not cropped for every pixel in the image, Anchors are cropped out after every few pixels in a sliding window fashion. The number of pixels after which a new set of anchors would be cropped out are decided by what was the compression factor of the feature map. In VGG Net , that number is 16. Hence, anchors cover each and every region of the image quite well. Faster RCNN uses the concept of anchors to cover a high percentage of regions in an image.

Anchor Generation

To generate anchors, we take the image width, height and the compression factor.

The code for this is given below

def getAnchorsForPixelPoint(i,j,width,height):
    anchors = [] 
    scales = [32,64,128,256]
    aspect_ratios = [ [1,1] ]
    for ratio in aspect_ratios:
        x = ratio[0]
        y = ratio[1]
        
        x1 = i  
        y1 = j
        for scale in scales:
            w = x*(scale)
            h = y*(scale)
            anchors.append([x1,y1,w,h])
    
    return anchors

One tip during the loading of input pipeline is to always try to visualise what you’re doing. let’s visualise our anchors for an image. The compression factor has been bumped up to not let the anchors crowd each other too much.

This image denotes anchors generate around some pixels.

Anchors

Anchors Centers

We can also visualise anchor centers here

Anchors

Conclusion

That’s all for today. We generated anchors and loaded the COCO dataset. We shall go more into the actual algorithmic details in future posts. If one wants to read more on object detection algorithms, read through the papers embedded in the README of the Facebook detectron repository.

Cheers