Computer vision is the buzzing technology for the researchers and the developers. Computer vision can be seen as the combination of various subjects such as computer science, mathematics, engineering, physics, biology, and Psychology. As the computer vision represents an understanding of the visual understanding and their context, many scientists believe that the technology paves the way towards artificial intelligence.
Computer vision is the explicit and meaningful description of the physical objects from images. Some of the major applications of computer vision are:
Face Recognition: Facebook and Snapchat uses face detection algorithm to apply filters, tag you and recognize you in the images.
Image Retrieval: Google images use content-based queries to search the relevant images. The algorithm can analyze the content in the query and return the results based on best-matched content
Gaming: Computer vision has found applications in the gaming consoles and stereo vision.
Surveillance: Computer vision technology also finds application in the surveillance cameras to detect the suspicious behavior.
Smart Cars: Computer vision technology is also used in the smart cars to detect the traffic and obstacles in the path.
In this article, we will discuss computer vision techniques for face detection:
For image classification, we have to write an algorithm that can classify the images into distinct categories. Computer vision researchers are using a data-driven approach to solve this problem. Instead of specifying how each image category will look like in the code, they provide computers with many examples of each image class. Or we can say that they first accumulate the training dataset of the labeled images and then feed it to the computer to process the data.
The most popular architecture used for the image classification is Convolution Neural Networks. A typical case of CNN is where you feed the network images and the network classifies the data. CNN starts with the input scanner which is not intended to parse all the training data at once. The input data is fed to the convolutional layers instead of the normal layers. The convolution layers tend to shrink as they become deeper.
The complete image classification pipeline can be formed as follows:
The objects within the images involve outputting the bounding boxes and the labels for the individual objects. This is different from classification and the localization to many objects instead of just a single dominant object. You only have two classes of the classification that object bounding boxes are non-object bounding boxes. If we use the Sliding Window technique, we have to apply CNN to a huge number of locations and scales. As CNN classifies each crop as an object or the background, there is a need to apply CNN to a huge number of location and scales which is not feasible.
In order to deal with the situation, neural network researchers have proposed to use the regions instead of blobby region images that contain objects. This is relatively faster. R-CNN was the first model that was used for this purpose.
Object tracking refers to the process of following a specific object or multiple objects in a given scene. It traditionally has the application in the real world interactions where observations are made by following an initial object detection. Object tracking methods can be divided into two major categories on the basis of the observation model: generative method and discriminative method. The generative method uses the generative model to describe the characteristics and reduce the error.
The discriminative method can be used to distinguish between the object and the background. Its performance is better and it gradually becomes the main method of tracking. The discriminative method is also known as tracking by detection.
Semantic segmentation is the central process in the computer vision technology. Semantic segmentation tries to semantically understand the role of each pixel in the image. Because apart from recognizing the objects, we also have to create boundaries for each object or person. Therefore, unlike classification, we need dense and pixel-based predictions for the models.
Like other computer vision task, CNN has an enormous success on the segmentation. One of the popular initial approaches was patch classification through the sliding window where each pixel was separately classified into classes using the patch of images around it. This is an inefficient approach because we don’t reuse the shared features between the overlapping patches.
Beyond the semantic segmentation, instance segmentation segments differ from the instances of the classes. For example, labeling 5 cars with 5 different colors. In the classification, there is generally an image with a single object as the focus and the task is to identify that image. In order to segment the instances, we have to carry out more complex tasks.
So far we have seen how we can use CNN in many interesting ways to locate the objects in the object with bounding boxes. Can such a technique be extended to locate each pixel of each object instead of just bounding boxes? Much like fast R-CNN, and faster CNN, Mask CNN’s intention is straightforward. Mask CNN outputs a binary mask that can identify whether given pixel is a part of the object or not.
These are the computer vision techniques that can help a computer to extract and analyze the useful information from a single or a sequence of images. There are many other advanced techniques which we haven't discussed. For example style transfer, colorization, action recognition, 3D objects, human pose estimation and more. Webtunix AI has developed face recognition models using computer vision and machine learning techniques which are able to solve real-life problems.