Self-Supervised Learning in Computer Vision: Image Classification

By: Johan Fernandes, Statistics Canada

Introduction

Computer Vision (CV) comprises tasks such as Image Classification, Object Detection and Image SegmentationFootnote 1. Image Classification deals involves assigning an entire image to one of several finite classes. For example, if an image contains a "Dog" occupies 90% of the space, then it is labeled as a "Dog". Multiple Deep Learning (DL) models using Neural Networks (NN) have been developed to accurately classify images with high accuracy. The state-of-the-art models for this task utilize NNs of various depths and widths.

These DL models are trained on multiple images of various classes to develop their classification capabilities. Like training a human child to distinguish between images of a "Car" and a "Bike", these models need to be shown multiple images of classes such as "Car" and "Bike" to generate this knowledge. However, humans have the additional advantage of developing context through observing our surroundings. Our minds can pick up sensory signals (audio and visual) that help us develop this knowledge for all types of objectsFootnote 2. For instance, when we observe a car on the road our minds can generate contextual knowledge about the object (car) through visual features such as location, color, shape, lighting surrounding the object, and the shadow it creates.

On the other hand, a DL model specifically for CV must be trained to develop such knowledge which is stored in the form of weights and biases it utilizes in its architecture. These weights and biases are updated with this knowledge by training the model. The most popular training process, called Supervised Learning, involves training the model with the image and the corresponding label to improve its classification capability. However, generating labels for all images is time consuming and costly, as it involves human annotators manually generating labels for each image. On the other hand, Self-Supervised Learning (SSL) is a new training paradigm that can be used to train DL models to classify images without the bottleneck of having well-defined labels for each image during training. In this work I, will describe the current state of SSL and its impact on image classification.

Significance of Self-Supervised Learning (SSL)

SSL aims to set up an environment to train the DL model to extract maximum features or signals from the image. Recent studies have shown that the feature extraction capability DL models is restricted when trained with labels, as they must pick signals that will help them develop a pattern to associate similar images with that labelFootnote 2Footnote 3. With SSL the model is trained to understand the sensory signals (e.g., shape and outline of objects) from the input images without being shown the associated labels.

Additionally, since SSL does not limit the model to develop a discrete representation (label) of an image, it can learn to extract much richer features from an image than its supervised counterpart. It has more freedom to improve how it represents an image, as it no longer needs to be trained to associate a label with an imageFootnote 3. Instead, the model can focus on developing a representation of the images through the enhanced features it extracts and identifying a pattern so that images from the same class can be grouped together.

SSL uses more feedback signals to improve its knowledge of an image than supervised learningFootnote 2. As a result, the term self-supervised is being used more frequently in place of unsupervised learning as an argument can be made that DL models receive input signals from the data rather than labels.  However, they do have some form of supervision and are not completely unsupervised in the training process. In the next section I will describe the components needed for self-supervised learning.

These signals are enhanced through a technique known as data augmentation, in which the image is cropped, certain sections of the image are hidden, or the color scheme of the image is modified. With each augmentation, the DL model receives a different image of the same class or category as the original image. By exposing the model to such augmented images, it can be trained to extract rich features based on the visible sections of the imageFootnote 4. Furthermore, this training method removes the overhead of generating labels for all images, opening up the possibility of adapting image classification in fields where labels are not readily available.

Components of self-supervised learning methods: 

Encoder / Feature Extractor:

As humans, when we look at an image, we can automatically identify features such as the outline and colour of objects to determine the type of object in the image. For a machine to perform such a task, we utilize a DL model, which we refer to as an encoder or a feature extractor since it can automatically encode and extract features of an image. The encoder consists of sequentially ordered NN layers, as shown in Fig 1.

Figure 1: Components of a Deep Learning encoder / feature extractor

Figure 1: Components of a Deep Learning encoder / feature extractor.

The image describes the structure of an encoder or feature extractor along with an example of the input it receives and the output it provides. The input to the encoder is an image which is shown as an image of a dog in this instance and the output is a vector that can represent that image in a higher dimensional space. The encoder consists of multiple single layered neural layers that are stacked on top of each other or next to each other as shown in this image. Each layer consists of multiple convolutional neurons. These layers will pick essential features that will help the encoder to represent the image as a vector which is the final output of the encoder. The vector that it produces at the end will have n dimensions where each dimension will be reserved for a feature. This vector can be projected in n dimension space and can be used for clustering vectors of the same class such as a dog or a cat.

An image contains multiple features. The encoder's job is to extract only the essential features, ignore the noise, and convert these features into a vector representation. This encoded representation of the image can be projected into n-dimensional or latent space, depending on the size of the vector. As a result, for each image, the encoder generates a vector to represent the image in that latent space. The underlying principle is to ensure that vectors of images from the same class can be grouped together in that latent space. Consequently, vectors of "Cats" will be clustered together while vectors of "Dogs" will form a separate group, with both groups of vectors distinctly separated from each other.

The encoders are trained to improve their representation of images so that they can encode richer features of the images into vectors that will help distinguish these vectors in latent space. The vectors generated by encoders can be used to address multiple CV tasks, such as image classification and object detection. The NN layers in the encoder would traditionally be convolutional neural network (CNN) layers as shown in Fig 1; however, the latest DL models utilize Attention Network (AN) layers in their architecture. These encoders are called Transformers, and recent works have begun to use them to address image classification due to the impact they have provided in the field of natural language processing. The vectors can be fed to classification models, which can be a series of NN layers or a clustering-based models such as K-Nearest Neighbor (KNN) classifier. Current literature on self-supervised learning utilizes KNN classifiers to cluster images, as they only require the number of clusters as an argument and do not need labels.

Data Augmentation:

Labels of images are not provided to encoders trained in a self-supervised format. Consequently, the representation capability of encoders has to be improved solely from the images they receive. As humans, we can look at objects from different angles and perspectives to understand the shape and outline of objects. Similarly, augmented images assist encoders by providing different perspectives of the original training images. These image perspectives can be developed by applying strategies such as Resized Crop and Color Jitter to the image, as shown in Fig 2. Augmented images enhance the encoder's ability to extract rich features from an image by learning from one section or patch of the image and applying that knowledge to predict other sections of the imageFootnote 4.

Figure 2: Augmentation strategies that can be used to train encoders in a self-supervised format. These augmentation strategies are randomly applied to the image when the encoders are trained.

Figure 2: Augmentation strategies that can be used to train encoders in a self-supervised format. These augmentation strategies are randomly applied to the image when the encoders are trained.

The image contains four ways to represent an image for SSL training. An image of a Corgi dog is used as a sample in this case. The first way is the original image by itself with no additional filters to the image. The second way is to horizontally flip the image. Hence the image of the Corgi Dog which was originally looking to its left is now looking to its right. The third way is to resize the image and to crop a section of the image which has the object of interest. In this case the Corgi dog is in the center of the image so a crop of the dog’s head and part of it’s body is used as an augmented image. The last way is to change the color scale of the image through color jitter augmentation. The color of the dog which was golden in color in the original image will change to blue color as per this augmentation strategy.

Siamese Network architecture:

Many self-supervised learning methods use the Siamese Network architecture to train encoders. As shown in Fig 3, a Siamese Network consists of two encoders that could share the same architecture (example: ResNet-50 for both encoders)Footnote 3. Both encoders receive batches of images during training (training batches). From each batch, both encoders will receive an image, but with different augmentation strategies applied to the images they receive. As shown in Fig 3, we consider the two encoders E1 and E2. In this network, image (x) is augmented by two different strategies to generate x1 and x2, which are respectively are fed to E1 and E2. Each encoder then provides a vector representation of the image, which can be used to measure similarity and calculate loss.

During the training phase, the weights between the two encoders are updated through a process known as knowledge distillation. This involves a student-teacher training format. The student encoder is trained in an online format where undergoes forward and backward propagation, while the weights of the teacher encoder are updated at regular intervals using stable weights from the student with techniques such as Exponential Moving Average (EMA)Footnote 3.

Figure 3 : A Siamese Network consisting of two encoders trained in parallel to generate representations of images, ensuring that representations of images from the same class are similar to each other.

Figure 3: A Siamese Network consisting of two encoders trained in parallel to generate representations of images, ensuring that representations of images from the same class are similar to each other.

The image describes the layout of a Siamese network which is a popular technique for training self supervised encoders. The Siamese network consists of two encoders which will have the same neural network architecture. Both encoders are trained in parallel. The image shows that an image of a corgi dog is sent to both encoders. One encoder behaves as a student which is called E1 while the other encoder behaves as a teacher which is called E2. E1 receives an image of a corgi dog with resize crop and horizontal flip augmentation. E2 receives an image of the same corgi dog as E1 with resize crop and color jitter augmentation. The image also shows that both encoders share their knowledge through weights at regular intervals in the training phase. Both encoders provide vector representations as their final output. A similarity score is calculated to measure if the E1 and was able to learn from the stable weights of the E2 and improve its representational knowledge.

Contrastive vs Non-contrastive SSL methods:

All available SSL methods utilize these components, with some additional changes to improve each other's performance. These learning methods can be grouped into two categories:

Figure 4: Positive and negative pair of image patches.

Figure 4: Positive and negative pair of image patches.

The image shows how you can create positive and negative pairs of images. The image is split into two parts. In the first part there are two different images of corgi dogs. Resize crop augmentation is used to extract the important sections such as the face and body of the dogs and to create two new images. The new augmented cropped images from both corgi dog images can now be considered as a positive pair as shown in the image. In the second part of this image an example of a negative pair of images is shown. Unlike the first part there is one original image of a Corgi dog and another of a cat. After resize crop augmentation is performed on these images we get to see two new images of the originals. One has the cat's face while the other has the Corgi dog's face. These new images will be considered as negative pair of images.

Contrastive learning methods

These methods require positive and negative pairs of each image to train and improve the representation capability of encoders. They utilize contrastive loss to train the encoders in a Siamese network with knowledge distillation. As shown in Fig 4, a positive pair would be an augmented image or patch from the same class as the original image. A negative pair would be an image or patch from another image that belongs to a different class. The underlying function of all contrastive learning methods is to help encoders generate vectors so that vectors of positive pairs are closer to each other, while those of negative pairs are further away from each other in latent space.

Many popular methods such as SimCLRFootnote 4 and MoCoFootnote 5, are based on this principle and work efficiently on large natural object datasets like ImageNet. Positive and negative pairs of images are provided in each training batch to prevent the encoders from collapsing into a state where they produce vectors of only a single class. However, to train the encoders with negative pairs of images, these methods rely on large batch sizes (upwards of 4096 images in a training batch). Furthermore, many datasets, unlike ImageNet, do not have multiple images per class, making generating negative pairs in each batch a difficult, if not impossible, task. Consequently, recent research is leaning towards non-contrastive based methods.

Non-Contrastive learning methods

Methods such as DINOFootnote 3, BYOLFootnote 6 and BarlowTwinsFootnote 7 train encoders in a self-supervised format without the need to distinguish images as positive and negative pairs in their training batches. Methods like DINO continue to use the Siamese Network in a student-teacher format and rely on heavy data augmentation. However, they improve on contrastive methods with a few enhancements:

  • Patches of images provide a local view of the image to the student and a global view of the image to the teacher encoderFootnote 3.
  • A prediction layer is added to the student encoder to generate a probability-based outputFootnote 3. This layer is only used during training.
  • Instead of calculating contrastive loss between pairs of images, the output from the encoders is used to calculate a classification type of loss, such as cross entropy or L2 loss to determine if the output vectors from the student and teacher encoders are similar or not Footnote 3, Footnote 6, Footnote 7, Footnote 8.
  • Employing EMA or any moving average method to update the teacher network's weights from the online weights of the student network, while avoiding backpropagation on the teacher network Footnote 3, Footnote 6, Footnote 7, Footnote 8.

Unlike contrastive methods, these methods do not require a large batch size for training and do not need additional overhead to ensure negative pairs in each training batch. Additionally, deep learning (DL) models such as Vision Transformer, which have the capability to learn from the local view of an image and predict other similar local views while also considering at the global view have replaced conventional CNN encoders. These models have enhanced non-contrastive methods to surpass image classification accuracies of supervised learning techniques.

Conclusion

Self-supervised learning is a training process that can help DL models train more efficiently than popular supervised learning methods without the use of labels. This efficiency is evident in the higher accuracy that DL models have achieved on popular datasets such as ImageNet when trained in a self-supervised setup compared to a supervised setup. Furthermore, self-supervised learning eliminates the need for labeling images before training, providing an additional advantage. The future looks bright for solutions that adopt this type of learning for image classification tasks as more research is being conducted on its applications in fields that do not involve natural objects, such as medical and document images.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

Date modified: