The brisk rise of modern social media led to the development of tracking tools, which have as unique purpose the monitoring of the written content published on social networks sites. These software products become inefficient when they are applied on tasks which are rich in visual content, with too little or inconsequential written information.
The main focus of the computer vision field is object recognition. Logo recognition is considered a subset of object recognition. Even though the problem is a highly researched one, with numerous papers backing up the conclusions, the problem is not currently declared solved. Therefore, the improvement of the current state of the art is no trivial task.
Solving this problem can have a major impact on protecting against intellectual theft, on ad placement, on validation and on the industry of online brand management.
Content Marketing Institute states that 63% of the posts found on social network sites contain or are themselves images. It was also was found that, in terms of visualizations and clicks, the posts that include images are 94% more accessed than the ones which have none. Hence, we can strongly conclude that images took over the social networks.
The sudden development of the modern social media from the early 2000 encouraged companies to introduce tracking tools meant to monitor the written content published within sites like Facebook, Twitter etc. Those tools become useless when it comes to interpreting the visual content. Without visual interpretation capabilities, images lose their value. Therefore, company owners lose precious information too. This phenomenon is encountered within networks like Instagram and Pinterest, because of their predominant visual content. In the last years, companies like Facebook and Twitter have started to adopt similar strategies.
If we are analyzing the problem in the context of Instagram, where classical software would be able to extract information only from hashtags and the little written information, we realize that wrong associations between information and images are highly possible because of human error. Thus, inconsistencies can easily appear.
Every company has a few users that are indeed passionate about the brand and, because of that, some become its promoters. This category is the one creating and spreading new material meant to promote the brand's products.
The marketing industry can benefit a lot from developing software products that are able to process the visual content too. This way, you can be aware of how your brand is perceived by users, allowing you to avoid any possible controversies before they even start.
The final part of this problem is to be able to recognize logos of famous companies. We can split this in two tasks: the detection and the recognition. The detection task should be able to identify the entities which depict a logo within images or video. The recognition refers to being able to fit the input into one of the n categories on which the algorithm was trained upon.
A few of the results which were published in the 90's use neural networks. These results were obtained using simple artificial neural networks with classic architectures . Neural networks played an important role in solving the detection task. Classification was done using a recurrent neural network for black and white logos². For watermark detection from, convolutional neuralnetworks (CNNs) were tried out².
The problem of logo classification was first solved using simple artificial neural networks having only fully connected layers. Unfortunately, due to poor performance, experts tried to improve results by using other methods. Consequently, the convolutional neural networks were used. Their architecture allows for the input pixels close to each other to be treated differently than the ones far apart. This is not the case for the classic neural networks. Because of this feature, CNNs give such impressive results on image classification tasks.
CNNs are based on 3 main ideas: local receptive fields, shared weights and pooling. Local receptive fields are square windows which contain pixels of an input image. These windows move to the right or downwards by a predefined number of columns or rows. The number of pixels which move the window is called stride and has the value of 1 or another number. Each element that is part of this window is connected to another neuron from the next hidden layer. The connection formed between these 2 elements is assigned a weight and a general bias for each window. The collected numbers from each pixel are summed up and the map from the input layer to the hidden layer is called feature map. All the weights and the bias, which are part of such a structure, are called shared weights and shared bias. A neuron from the hidden layer will get its value by using the following formula:
, where sigma is an activation function such as the sigmoidal one or rectified linear units (ReLU) and n is the window size. ReLU is the function used by the algorithm which will be described later on. All the neurons from the first hidden layer detect the same features but on the different parts of the input image. For example, some of the weights and biases might end up being able to select features such as vertical edges from a specific receptive field. Now, this feature, once learned, can be used for the same type of tasks on other regions of the image.
The last element which is specific on these networks is the pooling layer. Its purpose is to simplify the information received from a previous layer. This process usually follows a convolutional layer and it is done by creating a window similar to the one from local receptive fields. Commonly, this window has a squared shape, but you can pick any values you want and it is not mandatory to have the same size as the region from the local receptive fields. This window moves on the image surface, having a stride which indicates how many columns or rows it has shifted. The final purpose of this operation is to collect a value from the surface contained by the window. This value is calculated by using one of the many, already existing rules. The most used one is max pooling which chooses the biggest value from such a window.
size is 2x2 and the maximum element being extracted.
The last layer of these networks is a fully connected one, which means that each neuron provided by the pooling layer is interconnected to each neuron from the final layer. Just like in a regular neural network, this layer is a decisional one. Here is where the classification takes place. In order for a CNN to be efficient in its recognition task, multiple convolutional and pooling layers must be added. Moreover, a single feature map is not enough for image recognition. An important note is that: if we work with colored images, we will have three characteristics in the input type for each pixel corresponding to the red, green, blue channel. All of these are added to the already existent feature map.
Figure 2: Example of a very simple convolutional neural network. The input is a colored image (3 channels: RGB), therefore having 3 feature maps followed by a layer of max-pooling having the same number of maps as the last one. The last layer is the fully connected one, which for simplicity is represented only by a horizontal line³.
In order to start experimenting with the type of networks mentioned above, we need a proper data set on which the algorithm will be trained upon. This is a supervised learning task, so we need an annotated dataset such as FlickrLogos-32, which has labeled images containing logos from 32 different brand categories. The label accurately reflects the brand name depicted in the photographs. This data set has a total of 8240 colored images, split in 4280 images for training and 3960 for testing.
We have considered a pre-trained VGG19 as starting point for the training process. The network was pre-trained on ImageNet data set. In terms of framework, we picked Tensorflow using Python. For now, we only intend to solve the recognition task.
VGG19 is constructed from numerous convolutional layers followed by max pooling. At the end it will have 3 fully-connected layers. The last layer, fully-connected, is the decisional layer on which the softmax function is applied. This mathematical function outputs a tensor containing 32 probabilities made for an input image. Usually it will output as many probabilities as the number of classes defined. Because the algorithm was pre-trained on ImageNet, the softmax function gave a tensor with 1000 probabilities. Therefore, we changed that by dropping the last layer and retraining the algorithm for our custom needs.
Figure 3: Depicts the VGG19 architecture with the convolutional layers followed by the max-pooling operation ending with the 3 fully connected layers (2 fully connected layers and a decisional layer).
The only preprocessing done on the images was rescaling them to 224x224 size, keeping the aspect-ratio intact. Before that, we cropped the logo using the bounding boxes offered by the data set, therefore diminishing the noise from the images.
For training we used batches of size 10 and a learning rate equal to 0.001. In order to minimize overfitting, the algorithm uses dropout with the probability of 0.5.
We decided to check the loss gradually and while testing the accuracy on the test data. We started this process when the loss reached the value of 0.01 and we obtained an accuracy of ≈81%. The result of these measurements can be found in Table 1. When fully trained, we noticed a 91.2% accuracy.
Table 1: Presents the results after measuring the accuracy at certain points of the loss
In Figure 4 we present the results returned after passing the image from the left part of the figure as input. The results show the list of probabilities in descending order, the last element being the highest probability. Its position will determine the label predicted. The input image was recognized correctly with a 99.99% accuracy.
We noticed that the quality and the variety of the data set are crucial. We should consider that brands tend to have multiple logos, an aspect which makes the problem much more difficult to solve. We can take Disney as an example, which in a period of 27 years changed their logo more than 30 times. Just imagine how hard it would be to build a relevant data set which includes this brand.
Because we only solved the recognition part of the problem, and not the detection one, in order to evaluate a new image, we simply cropped out it center. Naturally if the logo is not central after the cropping operation, the algorithm will have nothing to evaluate, thus making a wrong prediction. Another thing to bear in mind is that the logo contained by the input image should be the same as the one on which the algorithm was already trained. If we tried to include different logo designs under the same class (label) in the training data, we would end up with an algorithm unable to generalize well for that specific class.
Neil Patel, Visual Content Strategy: The New 'Black' for Content Marketers
F. Iandola, A. Shen, P. Gao, K. Keutzer. Deep logo: Hitting logo recognition the deep neural network hammer. arXiv, October 2015.
Deep Learning, book by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2017 http://neuralnetworksanddeeplearning.com/chap6.html