From Human Vision to Computer Vision- Towards Spiked-Based Visual Intelligence and Neuromorphic Computing (4/4)

Published in

Becoming Human: Artificial Intelligence Magazine

10 min readApr 8, 2020

In the last story, I have talked about one of the most important breakthroughs in computer vision, the Convolutional Neural Networks (CNN). Today, CNNs are widely implemented into systems that require the processing of visual and spatial information and can be viewed as image features extractors and universal non-linear function approximators. They have achieved satisfying accuracy on complex tasks such as object recognition, semantic segmentation, depth and motion estimation, and visual odometry, which are crucial for the development of many visually-automated systems. However, human vision, by no means, is still unrivaled in terms of performance and robustness compared to that of the artificial vision, thus, prototypes of many systems today, e.g. automated driving system, still rely heavily on a diverse range of sensory systems to make up for this very limitation.

Moreover, as we progress and attempt to solve more advanced problems, increasing demand for computing and power resources became one of the most prominent issues for the CNNs as they require the extensive use of energy-intensive high-end graphic cards. The issue led to a shift in attention towards spiking neural networks (SNNs), a novel ANN inspired by the realistic neural dynamic of the brain. Aside from being known for their biologically plausible characteristics, they were also proven to be less computationally expensive than the conventional DNN and also show higher compatibility with dynamic event-based data, which could be apply to a real-time visual system. They are considered to be the third-generation neural network due to their event-driven, fast inference, and power-efficient nature.

So in this article, I wish to introduce you guys to novel Spiking Neural Networks (SNNs) and discuss how they, in combination with neuromorphic computing, could lead to a new breakthrough in achieving more dynamic and intelligent behavior in machine vision. It will also be the final article of the series, so I hope you guys enjoy it!

Visual Attention (VA) and Intelligence

One key element that differentiates the human’s vision from the current artificial vision is the ability to swiftly shift our focus and attention. In humans, divided attention allows the person to perform two or more tasks simultaneously and attentional shifting allows the person to quickly changes his focus and gain quick access to new information, thus, is expected that a high degree of divided attention and attentional shifting is associated with a high degree of intelligence. VA has long been extensively studied in computer vision and has been recognised as one of the crucial keys toward the next breakthrough in artificial intelligence.

So what makes visual attention so special and worth paying attention to? In both natural or artificial vision systems, raw sensory inputs are captured for further processing. Due to the immense amount of raw input available, the restriction of dataflow is crucial as it could overload the real-time computational possibility of the processing system. In humans, different mechanisms of data reduction are used, such as a motor shift of eyes or camera toward a certain object or entity (stimuli) localised in the visual field, which is known as overt attention. On the other hand, covert attention relies on the selection of information without any eye movements, only shifting the attention mentally. This selectivity allows the visual system to pay attention only to object(s) that are considered to be important at that specific moment in time while ignoring the rest of the less significant stimuli. This reduces the computational power immensely as the system no longer has to process through every single stimulus presented in the visual field. Therefore, it can be said that the core objective of visual attention is to achieve the least possible amount of visual information to be processed to solve complex high-level tasks, e.g., object recognition.

The General problems of VA in machine vision:

How can the system know what information is significant enough?
How does the visual system know when and how to direct attention and select significant information rather than doing so randomly?
Where is (are) the next potential target(s) of visual attention shifts? That is, how does it know where to actually focus its attention to?

Before anyone attempts to answer the questions above, it is important to first understand the pipeline of the natural visual system how such mechanism could be used to develope selective attention model in artificial vision.

Towards Event-Based Vision

Computer vision tasks primarily involve primarily of processing static images (or sequences of them such as frames in a video), the biological vision has shown to processes and emits fewer signals, mainly of changes occurring in the environment at a certain point in time. In simple words, cells in your eye only convey information to the brain when they detect a change in the scene — an event, while report nothing at all when no changes are detect. This key characteristic of biological vision systems allows the selective focus of attention on the salient portions of the scene, drastically reducing the amount of information that needs to be processed. Take an example of the frames captured from a video below.

https://www.prophesee.ai/2019/07/28/event-based-vision-2/

In conventional sensors, data is conveyed in frames, which includes everything presented on the image is processed including sky, trees, and grass, while the only important information is actually the movement of the person, the swing of the golf club, and the movement of the ball. To avoid this issue of overprocessing irrelevant information, event-based sensors were introduced. Event-based sensors send out data packages, or events, from each pixel asynchronously whenever a local brightness change is detected in the pixel, rather than reading every single pixel and sending out frames at a constant rate. Such event-based sensing allows us to perform some vision tasks extremely efficiently, reducing the amount of required computation, transmitted data, and power consumption. Researchers have also shown that collecting statistics on event-based sensors could pave the way to full visual reconstruction. This is also where spiking neural networks steps in.

Spiking Neural Network and Neuromorphic Computing

In accordance with what was mentioned in the last article, the picture above depicts a biological neuron and how they communicate with one another via action potential (which produces what known as ‘spikes’). A collection of spikes through time is known as spike train, as shown in the image below. They can be thought of as a collection of data (which in this case, is a function of time)

In traditional ANNs, the non-spiking neurons (see Fig 1.1) use differentiable, non-linear activation functions to propagate information between units, which allow units to be stacked into multiple layers.

Fig 1.1 A graphical illustration of a perceptron, a single layer neural network in conventional ANN

The derivative property of these neurons is also what makes learning through backpropagation via gradient-based optimization possible. The main difference between the traditional ANNs and the SNNs is that the SNNs adopt “spiking neurons”, which uses pulses of “spike” as the mean of communication, propagating information between units over time in a brain-like manner instead of using continuous activation value (see Fig 1.2 and 1.3). This spatio-temporal property (def: involving space and time) of the spiking neurons is also what makes SNN one of the most promising candidates to process temporal-dynamic visual data captured as a function of time by event-based sensors as well as in classical frame-based machine vision applications such as object recognition or detection, where they have proven to be accurate, fast, and efficient, especially when being run on neuromorphic hardware

Fig 1.3: Representation of Spiking neuron network

The network was initially developed in order to shed some light on the computing dynamic of the brain. Interestingly, in terms of engineering motivation, SNNs also hold apparent advantages over traditional neural networks regarding performance speed and power-consumption when implemented on neuromorphic hardware platforms, which could resolve the power-consumption issue faced by CNN. This is due to the unique nature of the networks in which output spike trains can be made sparse in time. Since each spike would consume energy, having few spikes which contains high information content could effectively lower the total 6energy consumption. Neuromorphic systems and hardware design are also based on this spiking property and together with the implementation of SNNs, neuromorphic systems could play the key role in the progression of next-generation artificial intelligence.

Since SNNs is proven effective at processing sensor information in real-time, it could become extremely beneficial in a dynamical visual system such as autonomous vehicles where it could improve the emergency brake assistants in which challenging weather condition as well as suddenly appearing vehicle or pedestrian are the main risk factors during high-speed maneuvering.

Visual Processing in CSNN, mimicking the visual system in human brain

Aside from selective attention model, In recent years, various models of SNNs have been proposed to solve object recognition tasks, including the hybrid type such as the Convolution Spiking Neural Network that adopts conversion algorithms on the conventional CNN, in which weights are converted into spike signal input with leaks and refractory period. The main idea behind this hybrid architecture is to replace the CNN classifier unit with a spiking neuron whose firing rate is correlated with the output of that unit (shown in image above).

The Current Limitation

Despite the promising potential, in practice, SNNs has a very challenging drawback where learning was proven difficult to train, especially when the network becomes multi-layer. One of the reasons being the lack of effective training and learning algorithms as the spike function adopted by the neurons is non-differentiable while backpropagation mechanism, which uses the derivative property of the neurons to train ANNs in a supervised manner, is what made the CNN one of the most, if not, the most powerful object classification/recognition tool to date. Many researchers believed that the performance of SNNs can be improved to catch up with that of ANNs by embedding the deep architecture into the network (Machado, Cosma, and McGinnity, 2019; Tavanaei et al, 2019; Xu, 2019). In order to mend this gap between ANN continuous-valued networks SNN, there is a crucial need to develop learning methods that could support deep (multi-layer) SNN with low error rates as their conventional counterparts. Successful approaches have been shown which include direct training of SNNs using backpropagation and applying stochastic gradient descent on to the SNN classifier layers (Stromatias et al., 2017). Spike-Timing Dependent Plasticity (STDP), a learning rule inspired by the plasticity algorithm of the brain that could be applied in both supervised and unsupervised manner, are also extensively studied due to its biologically plausible nature and possible implementation of low-power on-chip local learning.

Conclusion and Remarks

As one can see, we are now one step closer to achieving the biological-like vision. We have come a long way from a simple neural network to CNN, and eventually, to SNN. While CNN is perfect for object recognition in static images, it lacks a dynamic nature to process real-time datasets from newly developed event-based sensors which are dependant on time, thus making SNN a more promising candidate for real-time object recognition and processing task. Moreover, many studies have shown that SNNs have a potential to replace the power-hungry CNN as the spiking algorithms can be implemented on neuromorphic systems.

I hope everyone who read through the series now has a better idea of the progression in computer vision and how neuroscience had greatly contributed to the breakthrough of such a fascinating field (and will continue to do so). In the next article, I will dive deeper into the technical property of Spiking Neural Network including the encoding and learning rules. I have left a list of references that I used in this article which could be served as additional readings for those who are interested. Do share my article if you find it useful. See you next time!

Don’t forget to give us your 👏 !