18/08/2020

An AI that Works Like the Human Eye: 

a novel fully convolutional network for visual saliency prediction

The Human Visual System  – the eye and the parts of the brain that process and interpret what we see – naturally focuses on a part of the scene in view rather than the whole scene. This tendency is known as visual attention or saliency, and it is how we “spot” something. Our brains tell is where to focus attention based on prominent or conspicuous visual features such as brightness, colour, edges and motion, as well as priority-related features that distinguish something we are “looking for”.

Understanding human visual saliency, or how and why human sight focuses on in a particular scene  is an important topic in reasearch on computer vision, or visual AI.  Visual AI that can approximate human visual saliency can help advance a broad range of technologies, from detecting particular targets in satellite imagery to safe navigation in driverless vehicles.

Many models have been developed to replicate or predict visual saliency, the most popular being the saliency map. Saliency maps describe the probability that each image pixel will attract human attention, based on either conspicuous feaures or priority-related features; these maps display the unique qualities of each pixel in a given image. The probability that each pixel in the image will attract human attention is represented by a heat map or gray-scale image. This is useful because it indicates what is salient and what is not.

The Convolutional Neural Network (CNN) is the most widely used deep learning method for image processing. Specifically, CNN can extract distinguishing visual features (e.g., 2-D spatial features) by applying a hierarchy of convolutional filters using multiple nonlinear transformations. The deep CNN model achieves an even higher classification accuracy (in, for example, scene classification, object  detection, image classification and semantic segmentation) but requires a large volume of training data for superior performance.

Although several deep learning models have been proposed to date,  pixel-based classification of visual attention remains challenging. C-CORE proposes an encoder-decoder Fully Convolutional Network (FCN) model designed for training from scratch, with three inception modules and residual modules to improve performance.

A technical paper describing this FCN model has been published here: https://peerj.com/articles/cs-280/#p-1