Starter’s pack for Computer Vision

Nanubala Gnana Sai
Analytics Vidhya
Published in
4 min readAug 3, 2020

--

Among the many disciplines in the field of machine learning, computer vision has arguably seen unprecedented growth. In its current form, it offers plethora of models to choose from, each with its own shine. It’s quite easy then, to lose your way in this abyss. Fret not, for this foe can be defeated, like many others, using the power of mathematics and a touch of intuition.

Before we venture forth, it is important to have the basic knowledge of machine learning under your belt. To start with, we should understand the concept of convolution in general and then we can narrow it down to its application in machine learning. In hindsight, by convolution, we mean that a function strides over the domain of another function thereby leaving its imprint(Fig 1.).

Fig1: The function g(x) passes through the function f(x) in its domain leaving an imprint. Source: Wikipedia

A computer can’t really “see” an image, all it can perceive are bits, 0/1. To comply with this lingo, we can represent images as a matrix of numbers, where each number corresponds to the pixel strength (0–255). We can then perform convolution by taking a filter window which strides over the image progressively(Fig 2.). Each filter is associated with a set of numbers which is multiplied to a portion of the image to extract a specific information. We employ multiple kernels to gather various aspects of the image. The end-goal is to learn suitable kernel weights which best encodes the data for our use case. This information capture process is what grants a computer the ability to “see”.

Fig2: Convolution at work, input image being convolved over using kernel weights. Source: Wikimedia

You might have noticed that down the neural net, the feature map tends to shrink. In fact, it is entirely possible for the image to vanish down the lane. In addition to that, the edges of the images get almost no say to the result since the filter passes through them only once. This is where the concept of ”Padding” debuts. Padding implies shielding our original image with an additional layer of zero value vectors. With this, we have already solved the first problem, i.e. feature map shrinking, with a smart choice of padding we can have the output feature map to have exact dimension as input, this is called “SAME PADDING”. This also ensures that the kernel filter overlaps more than once on the edges of our image. The case where we don’t employ this feature is called as “VALID PADDING”.

Any significant stride in technology must surpass its predecessor in every way. In this light, the thought arises, where exactly does the classic framework fall behind? This can be very easily explained once we examine the computational cost. Dense layer consists of a tight-knit connection between each and every layer and “each connection is associated with a weight”(Ardakani et al.). On the contrary, since convolution only considers a portion of the image, we can interpret it as a “sparsely connected neural networks”. In this architecture, “each neuron is only connected to a few neurons based on a pattern and a set of weights is shared among all neurons”(Ardakani et al.). The tight-knit nature of Dense Layers is the reason it has exponentially higher number of learnable parameters than Convolution Layers.

I think there are two main advantages of convolutional layers over just using fully connected layers. And the advantages are parameter sharing and sparsity of connections.

-Andrew NG

It is often observed that a convolutional layer appears in conjunction with a “Pooling” layer. Pooling layer, as the name suggests, down-samples the feature map of the previous layer. This is important as plain convolution latch tightly to the input feature map, this means even the finest distortion in the image may lead to entirely different results. By down-sampling, we get a summary statistic of the input thereby making the model translation invariant.

Imagine an image of a cat comes in.

Imagine the same image comes in, but rotated.

If you have the same response for the two, that’s invariance.

-source: Tapa Ghosh, www.quora.com

By “translation invariant” we mean it’s invariant to linear shift of the target image.

Fig2: Original image(left) shifted left(right) produces the same output. Source: Vishal Sharma, www.quora.com

Pooling layers cuts through these noises and lusters the dominant features to shine brighter. This provides immunity against these distortions and makes our model robust to changes. There are two of methods of sampling often employed

  1. Average Pool: This takes a filter and averages over the image. This gives a fair-say to the nuances of the image. Often this method is not employed.
  2. Max Pool: Used prevalently, it takes the maximum of pixel values under its window. In this method, we take only the most dominant feature into consideration.

It is important to note that, pooling layer in itself does not have learnable parameters. They are fixed size operations and they are set before training, a.k.a “hyper-parameters”. Some models e.g. MobileNet doesn’t rely on pooling layers for down-sampling instead “down sampling is handled with strided convolution”(Howard et al.).

Convolutional neural network is now the go-to method for computer-vision problems. It’s introduction to this field has been a true game changer. It continues to be precise, faster and robust by the day but it’s roots nevertheless are humble.

Bibliography

  • Ardakani, Arash, Carlo Condo, and Warren J. Gross. “Sparsely-connected neural networks: towards efficient vlsi implementation of deep neural networks.” arXiv preprint arXiv:1611.01427 (2016).
  • Andrew Y. Ng, “Why Convolutions?”, www.coursera.org
  • Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861 (2017).

--

--