What is Computer Vision and how it works?

Anh Khoa Nguyen Huynh
Aug 11, 2018
4 min read

Here is everything you need to know about Computer Vision.

Our world is created from millions of objects. Lots and lots of objects. Everything from animals, trees, buildings, humans, and more. We recognize it and process it everyday. But what if a machine are also capable of classify and process objects from images? That's what we called Computer Vision.

A brief definition

First, let's have some brief definition about computer vision.

Computer vision is a field of computer science that works on enabling computers to see, identify and process images in the same way that human vision does, and then provide appropriate output.

A similar visual system

Most life in the planet all have similar visual system, which include eye for capturing lights, receptors in the brain for accessing it, and a visual cortex to process it. Those mechanisms help us to "capture" images around us and understand what they mean. In the past 40 years, we have been striving to extend our visual ability to a whole new level, by creating tools like cameras and telescope. With larger, more optically perfect lenses, the precision and sensitivity of modern cameras is nothing short of incredible. Cameras can also record thousands of images per second and detect distances with great precision. But these tools can only "capture" the images, not "understand" it. Understanding what's in the image is a whole new challenge.

Understand the pixel behind the images using machine learning

Let's take a look at the picture below:

Our brains would easily recognize that this is a flower. But since we have millions years of evolution, we could immediately understand what it is, and this consider cheating for a computer. Because for them, an image would look like a matrix of integer value, where each value represent a color of a region in the image. So how could we make the algorithm to understand "this image is a flower"? The answer is Machine Learning.

Machine learning allows us to effectively train the context for a dataset, so that an algorithm can understand what all those number in a specific organization represent.

Where Is Computer Vision Being Used Today?

Major players in the computer vision and AI world include platforms like Google, Facebook, and Instagram. Such companies also use object recognition, but rather than using it to determine risks on the road they use it to categorize posts and make advanced editing options available to end-users.

Computer Vision and Machine Learning are friends

What if we have images that are difficult for human to classify, can computer vision achieve better accuracy?

Let's take an example of sheep dogs vs mops.

In general, they are pretty hard to distinct, right?

But with the help of machine learning, we can take a bunch of images of sheep dogs and mops, label them correctly, and as long as we feed it enough data, it will eventually properly tell the differences between the two.

But the challenge to this approach in processing images, however, is the amount of data we need to truly mimic human vision is enormous. If we train our model to fully understand recognize a picture of a dog the way human brain does, we need to have the amount of data of millions of object across thousands of angles and lighting, and properly labeled. It's hard for anyone to gather that amount of data.

That is why we have some free computer vision datasets from MIT, Caltech, and especially Open Image Datasets from Google that contains more than 9 millions images from many objects that have been annotated with image-level labels and object bounding boxes.

With these datasets, it's easier to get started in learning and creating our own Computer Vision Model.

General methods

In Computer Vision, it's all about creating an image classification model, so how to create one?

There are four important steps that are required to create an image classification model:

Step 1: Gather your Dataset:

The first component of building a deep learning network is to gather our initial dataset. We need the images themselves as well as the labels associated with each image. These labels should come from a finite set of categories, such as: categories = dog, cat, panda

Step 2: Split your Dataset:

Now that we have our initial dataset, we need to split it into two parts:

A training set
A testing set

A training set is used by our classifier to “learn” what each category looks like by making predictions on the input data and then correct itself when predictions are wrong. After the classifier has been trained, we can evaluate the performing on a testing set .
Common split sizes for training and testing sets include 66.6/33.3, 75/25, and 90/10, respectively.

Step 3: Feature Extraction:

During this phase, we apply hand-engineered algorithms such as HOG [32], LBP [21], etc. to quantify the contents of an image based on a particular component of the image we want to encode (i.e., shape, color, texture). Given these features, we then proceed to train our classifier and evaluate it.

Step 4: Train your Network:

Given our training set of images, we can now train our network. The goal here is for our network to learn how to recognize each of the categories in our labeled data. When the model makes a mistake, it learns from this mistake and improves itself.

Step 5: Evaluate:

Last, we need to evaluate our trained network. For each of the images in our testing set, we present them to the network and ask it to predict what it thinks the label of the image is. We then tabulate the predictions of the model for an image in the testing set.
Finally, these model predictions are compared to the ground-truth labels from our testing set. The ground-truth labels represent what the image category actually is. From there, we can compute the number of predictions our classifier got correct and compute aggregate reports such as accuracy, precision, recall, and f-measure, which are used to quantify the performance of our network as a whole.

What's next?

Next time, we will learn more about accuracy, precision, recall, and f-measure, which are very common in Machine Learning and Computer Vision. We will learn how they work to quantify the performance of our network as a whole.

Classifying all major Machine Learning algorithms

Supervised vs. Unsupervised Learning