Farewell to 2019: a decade of deep learning, a classic we must know

 Farewell to 2019: a decade of deep learning, a classic we must know

This article summarizes the influential papers in the field of deep learning in the past decade, from relu, alexnet, Gan to transformer, Bert, etc. Every year there are many honorary nominations, including many well-known research results.

2011: activation function relu


Paper link: http://procedures.mlr.press/v15/glorot11a/glorot11a.pdf (4071 cited)

Sigmoid activation function is widely used in early neural networks. Although it has a good effect, but with the increase of the number of layers, this activation function makes the gradient easily appear attenuation. In this paper in 2011, relu was formally proposed. It can help solve the problem of gradient disappearance and pave the way for neural network to increase depth.

Sigmoid and its derivatives.

Of course, relu has some disadvantages. When the function is 0, it is not differentiable, and the neuron may die. After 2011, many improvements for relu have also been proposed.

Honorary nomination of the year: (most of the research in this year focuses on the improvement of activation function)


Based on the improved activation function of relu, when x is negative, it does not take 0.

Paper link: https://ai.stanford.edu/ ~ amaas / papers / relu u1989 hybrid u1989 icml2013 u1989 final.pdf


Paper link: https://arxiv.org/abs/1511.07289


Paper link: https://arxiv.org/abs/1706.02515

This activation function has been proved to be better than relu, and some models such as Bert have been used.

Paper link: https://arxiv.org/abs/1606.08415

In 2012: alexnet set off a wave of in-depth learning


Paper link: https://papers.nips.cc/paper/4824-image-classification-with-deep-revolutionary-natural-networks (52025 cited)

Alex net architecture.

Alex net is often regarded as the starting point of this wave of artificial intelligence. The error rate of the network in the Imagenet challenge is reduced by more than 10% compared with the previous champion, 10.8% higher than the runner up. Alex net is designed by the supervision group of the University of Toronto and consists of Alex krizhevsky, Geoffrey Hinton and Ilya sutskever.

Alexnet is an 8-layer convolutional neural network, which uses the relu activation function and has a total of 60 million parameters. Alexnets greatest contribution is to prove the ability of deep learning. It is also the first network to use parallel computing and GPU for acceleration.

On Imagenet, alexnet has made a good performance. It reduces the recognition error rate from 26.2% to 15.3%. Remarkable performance improvement has attracted the industrys attention to in-depth learning, making alexnet the most cited paper in this field.

Annual honorary nomination:


Imagenet is a data set for image recognition completed by Li Feifei of Stanford University. It is a benchmark data set for model performance test in the field of computer vision.

Paper link: http://www.image-net.org/papers/imagenet_cvpr09.pdf


Paper link: http: / / people. Idsia. Ch / ~ Juergen / ijcai2011.pdf


Paper link: http://vision.stanford.edu/cs598 spring07/papers/lecun98.pdf

2013: NLPs classic word2vec, opening an era of intensive learning


Word2vec is a model proposed by Thomas mikolov and others in Google research team, which can calculate the continuous vector used to represent words from a very large data set. Word2vec has become the main text encoding method in NLP field. It is based on the idea that words have similar meanings in the same context, so that the text can be embedded as vectors and used for other downstream tasks.

Annual honorary nomination:


Paper link: https://nlp.stanford.edu/pubs/glove.pdf


Paper link: https://www.cs.toronto.edu/ ~ vmnih / docs / dqn.pdf (3251 cited)

Deepminds dqn model plays yadali game

Deepmind proposed to play yadali game with dqn in this year, which opened the door of deep reinforcement learning research. Reinforcement learning used to be used in low dimensional environment in most cases, and it is difficult to use in more complex environment. Yadali game is the first application of reinforcement learning in high-dimensional environment. In this study, deepq learning algorithm is proposed and a reward function based on value is used.

Annual honorary nomination:


Paper link: http://www.cs.rhul.ac.uk/ ~ Chris / new_thesis.pdf


Paper link: https://papers.nips.cc/paper/5423-general-universal-nets (cited 13917)

The generative adversarial network (GAN) is an unsupervised learning method, which is proposed by Ian goodsell and others. The two neural networks play games with each other. Since the Gan network was put forward in 2014, it has been widely concerned in the field of computer vision.

The success of Gan is that it can generate realistic pictures. By using the minimax game between generator and discriminator, Gan can model high latitude and complex data distribution. In Gan, the generator is used to generate false samples, while the discriminator is used to judge whether the data is generated or not.

Annual honorary nomination:


Wgan is an improved version of GaN and has achieved better results.

Paper link: https://arxiv.org/abs/1701.07875


Stylegan generated image

Paper link: https://arxiv.org/abs/1812.04948

3. Neural machine translation by joint learning to align and translate (attention mechanism)

Paper link: https://arxiv.org/abs/1409.0473 (9882 cited)

This paper introduces the idea of attention mechanism. Instead of compressing all information into a hidden layer of RNN, it is better to keep the whole context in memory. This allows all outputs to correspond to inputs. In addition to machine translation, attention mechanism is also used in models such as Gan.


Paper link: https://arxiv.org/abs/1412.6980 (34082 cited)

Adam is widely used because of its easy to fine tune feature. It is based on the idea of adapting to different learning rates of each parameter. Although recently there have been papers questioning Adams performance, it is still the most popular objective function in deep learning.

Annual honorary nomination:


Paper address: https://arxiv.org/abs/1711.05101


As well known as Adams objective function.

Paper address: https://www.cs.toronto.edu/ ~ tijmen / csc321 / slides / collection ufe63 slides ufe63 lec6.pdf*

2015: RESNET beyond human; magical batch normalization

Paper link: https://arxiv.org/abs/1512.03385 (34635 cited)

Residualblock structure.

Since RESNET, the famous RESNET, the performance of neural network in visual classification task has surpassed human for the first time. This method won the titles of imagenet2015 and coco competition, as well as the best paper award of cvpr2016: the authors of this study are he Kaiming, Zhang Xiangyu, Ren Shaoqing and Sun Jian.

RESNET was originally designed to deal with the problem of gradient disappearance and gradient explosion in deep CNN structure. Nowadays, residualblock has become the basic structure in almost all CNN structures.

The idea is simple: add input from each block of the rollup to output. The Enlightenment of residual network is that neural network should not be decomposed into more layers. In the worst case, other layers can be simply set as identity mapping. But in practice, deeper networks often encounter difficulties in training. The residual network makes it easier for each layer to learn identity mapping and reduces the problem of gradient disappearance.

Although it is not complicated, the residual network is much better than the conventional CNN architecture, especially when it is applied to the deeper network.

A comparison between several CNN networks.

Many CNN architectures are competing for the top position. Here are some representative samples:

Perception V1 structure.

Annual honorary nomination:



Paper link: https://arxiv.org/abs/1409.1556


Paper link: https://arxiv.org/abs/1806.07366 (nips2018 Best Paper Award)

Batch normalization: accelerating deep network training by reducing internal covariateshift

Paper link: https://arxiv.org/abs/1502.03167 (14384 citations)

Batch normalization is the main trend of almost all neural networks. Batch one is based on another simple but great idea: keep the mean and variance statistics during the training process, so as to transform the range of activation into zero mean and unit variance.

The exact reason of batch normalization effect has not been determined, but it is effective in practice.

Visualization of different normalization techniques.

1. Layer normalization

Paper link: https://arxiv.org/abs/1607.06450

2. Instance normalization

Paper link: https://arxiv.org/abs/1607.08022

Paper link: https://arxiv.org/abs/1803.08494

2016: capture the most complex game - alphago

On natures alphago paper mastering the game of go with deep network and research

Paper link: https://www.nature.com/articles/nature16961 (cited quantity: 6310)

Many peoples understanding of modern AI began with deepminds go program alphago. The alphago research project started in 2014 to test how a neural network using deep learning can compete on go.

Alphago has a significant improvement over previous Go programs. In 500 games with other available Go programs (including crazystone and Zen), alphago running on a single computer won all but one, while alphago running on multiple computers won all 500 games against other go programs, and alphago running on a single computer won all 500 games against other go programs Win 77% of the game. In October 2015, the distributed version used 1202 CPUs and 176 GPUs. At that time, it beat the European go champion fan Hui (professional 2-stage player) by 5:0, making a sensation.

This is the first time that the computer Go program has defeated the human professional players on the global board (19 u00d7 19) without giving way. In March 2016, alphago, an enhanced version practiced through self playing, beat world go champion Li Shishi 4-1 in the competition, becoming the first computer program to beat professional nine section players of go without letting him, which was recorded in history. After the game, alphago was awarded the title of honorary profession Jiuduan by Korean Chess Academy.

Annual honorary nomination:

Thesis link: https://www.nature.com/articles/nature24270

As a follow-up version of alphago, deepmind released the latest enhanced version of alphago zero in October 2017, which is a version that does not need to use human professional chess manual, and is more powerful than the previous version. Through self playing, alphago zero has surpassed the level of alphago Lee version after three days of learning. After 21 days, it has reached the strength of alphago maseter, surpassing all previous versions within 40 days.

2017: transformer for almost everyone


The famous transformer architecture appeared. In June 2017, Google announced that it had taken a further step in machine translation, realized the complete attention based transformer machine translation network architecture, and surpassed the previous achievements of Facebook in the translation tasks of wmt2014 in multiple languages, and achieved a new best level.

In encoder decoder configuration, dominant sequence dominant transduction model is based on complex RNN or CNN. The best performance model also needs to connect the encoder and decoder through attention mechanism.

Google has proposed a new simple network architecture, transformer, which is completely based on attention mechanism and completely gives up cycle and convolution. Experiments on two machine translation tasks show that the translation quality of these models is better and the training time is greatly reduced. The new model achieved Bleu score of 28.4 in wmt2014 English to German translation task, leading the current best results (including integrated model) by more than 2 Bleu scores. In the task of wmt2014 English to French translation, after training on 8 GPUs for 3.5 days, the new model obtained a new single model top Bleu score of 41.0, which is only a small part of the training cost of the best model in the current literature.

Transformer also generalizes well in other tasks, and successfully applies it to the English group analysis with a large number of training data and limited training data.

Paper link: HTTPS: / / openreview. Net / forum? Id = r1ue8hcxg (cited 1186)

Neural structure search (NAS) represents the process of automatic design of artificial neural network (ANN), which is widely used in the field of machine learning. The performance of the neural network designed by various methods of NAS is equal to or even better than that designed by hand. NAS methods can be classified according to search space, search strategy and performance evaluation strategy. Other methods, such as regularized evolution for image classifier architecture search (amoebanet), use evolutionary algorithms.

2018: pre training model fever

Of course, Googles NLP pre training model, Bert: pre training of deep bi directional transformers for language understanding, has 3025 citations.

Paper link: https://arxiv.org/abs/1810.04805

In this paper, a new language representation model, Bert, is introduced. Different from the recent linguistic representation model, Bert aims to pre train deep bi-directional representation based on all levels of left and right context. Bert is the first representation model based on fine tuning to achieve the current optimal performance in a large number of sentence level and token level tasks. Its performance surpasses many systems using task specific architecture, and refreshes the current optimal performance record of 11 NLP tasks.

Annual honorary nomination:

Since the birth of Bert, the language model based on transformer has been in the trend of blowout. Its hard to say which of these papers is the most influential.


Paper link: https://arxiv.org/abs/1802.05365


Paper link: https://s3-us-west-2.amazon aws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

3. Language model are unsupervised multitask learners, a 1.5 billion parameter pre training model gpt-2 launched by openai in February.

Paper link: https://d4mufpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf


In the past, due to the limitation of fixed context length, the potential of learning relationship has been limited for a long time. The new neural architecture proposed in this paper, transformer XL, can go beyond fixed length to learn dependency without causing time confusion, and can also solve the problem of context fragmentation.

Paper link: https://arxiv.org/abs/1901.02860


The impact of Bert has not been recovered. Xlnet proposed by CMU and Google brain in June surpassed Berts performance in 20 tasks, and achieved SOTA in 18 tasks.

Paper link: https://arxiv.org/abs/1906.08237

Paper link: https://arxiv.org/abs/1508.07909

2019: principle improvement of deep learning

In the paper deep double descent: where bigger models and more data Hart, the phenomenon of double descent discussed in this paper is contrary to the popular views in classical machine learning and modern deep learning.

In this paper, the researchers prove that all kinds of modern deep learning tasks show a double decline phenomenon, and with the increase of model size, the performance first gets worse, then gets better. In addition, they show that double descent not only appears as a function of model size, but also as a function of the number of training time points. Researchers define a new complexity measure (called effective model complexity) to unify the above phenomena, and speculate a generalized double decline for this measure. In addition, their concept of model complexity enables them to identify certain scenarios in which increasing (or even quadrupling) the number of training samples actually compromises test performance.

The lotus ticket cryptogenesis: finding sparse, trainable neural networks, a paper from mitcsil researchers, also won the best paper award of ICLR 2019.

Paper link: https://arxiv.org/abs/1803.03635

Researchers have found that the standard neural network pruning technology will naturally find sub networks, which can be effectively trained after initialization. Based on these results, the researchers propose a lottery hypothesis: the dense, randomly initialized feedforward network contains sub networks (winning lottery). When trained independently, these sub networks can achieve the same test accuracy as the original network in a similar number of iterations.


Because of the breakthrough of deep learning and gradient based neural network technology, the past decade is a period of rapid development of artificial intelligence. This is largely due to the significant improvement of the chip computing power, the neural network is becoming larger and larger, and the performance is also becoming stronger and stronger. From computer vision to natural language processing, new methods have largely replaced the traditional AI technology.

However, neural networks also have its disadvantages: they need a large number of labeled data for continuation, unable to explain their own inference mechanism, and difficult to be extended beyond a single task. However, because of the promotion of deep learning and the rapid development of AI field, more and more researchers are committed to these challenges.

In the next few years, peoples understanding of neural networks will continue to increase. The future of artificial intelligence is still bright: deep learning is the most powerful tool in the field of AI, and it will bring us closer to real intelligence.

Lets look forward to new achievements in 2020.