As the last heavyweight international academic conference on artificial intelligence in 2019, some of the artificial intelligence research trends reflected by NeurIPS 2019, such as the interpretability of neural networks, new deep learning methods, neuroscience, etc., will surely be of certain reference value for everyone to carry out research work in the new year.

NeurIPS 2019 held 51 workshops, received 1,428 papers, and had more than 13,000 attendees, which was highly anticipated.

Chip Huyen, an engineer from NVIDIA, gave a comprehensive summary of the key research trends reflected in NeurIPS 2019 based on her own experience attending the conference.

Let’s look at them one by one:

Deconstructing the black box of deep learning

Recently, researchers have reflected a lot on the limitations of deep learning. Here are a few examples:

Facebook's director of artificial intelligence expressed concerns about reaching a computing bottleneck. AI companies should not just hope to make progress through larger deep learning systems. Because "right now, an experiment may cost seven figures, but the reality is that this number will not grow to nine or ten figures, because no one can afford such expenses."
Yoshua Bengio believes that some people, represented by Gary Marcus, often point out the limitations of deep learning. Bengio summarized Gary Marcus's point of view as "Look, I said deep learning doesn't work", while Gary Marcus refuted this statement.
In response to this trend, Yann Lecun said: "I don't understand why suddenly we see a lot of news and tweets claiming that progress in AI is slowing down or that deep learning is hitting a wall. In the past five years, I have pointed out these two limitations and challenges in almost every speech. So, recognizing these limitations is nothing new. And, in fact, the development of AI has not slowed down."

In this environment, we are excited to see an explosion in the number of papers exploring the theory behind deep learning (why does deep learning work? How does it work?).

At this year's NeurIPS, there were 31 papers that combined various techniques. The Outstanding New Direction Paper Award of this conference was awarded to Baishnavh and J.Zico Kolter's paper "Uniform convergence may be unable to explain generalization in deep learning"

They argue that the theory of uniform convergence alone cannot explain the generalization ability of deep learning. As the size of the dataset increases, the theoretical bound on the generalization gap (the difference in performance of the model on seen and unseen data) will increase, while the empirical generalization gap will decrease.

Paper link: https://arxiv.org/abs/1902.04742

Figure 1: Generalization difference and generalization boundary change with the size of the training set

Neural Tangent Kernel (NTK) is a research direction proposed in recent years to understand optimization and generalization of neural networks. Discussions about NTK appeared in many highlights of this year's NeurIPS, and I also talked about NTK with others many times during NeurIPS.

Arthur Jacot et al. proposed the well-known concept that "fully connected neural networks are equivalent to Gaussian processes of infinite width", which enables the study of their training dynamics in function space rather than parameter space. They proved that "during the gradient descent of artificial neural network parameters, the network function (mapping input vectors to output vectors) follows the kernel gradient of the function cost with respect to a new kernel - NTK".

They also showed that when we train a finite-layer version of NTK using gradient descent, its performance converges to that of NTK with infinite width, and then the performance remains unchanged during training.

Below, we list some papers based on NTK at this year's NeurIPS:

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers, paper link: https://arxiv.org/abs/1811.04918
On the Inductive Bias of Neural Tangent Kernels, paper link: http://papers.nips.cc/paper/9449-on-the-inductive-bias-of-neural-tangent-kernels

However, many people believe that NTK cannot fully explain deep learning. For a neural network to approach the NTK state, it needs to have hyperparameter settings such as small learning rate, large initialization width, and no weight decay, which are not often used in actual training.

The NTK argument also states that neural networks will only generalize as well as kernel methods, but in our experience they can generalize better.

The paper "Regularization Matters: Generalization and Optimization of Neural Nets vs their Induced Kernel" by Colin Wei et al. theoretically proves that neural networks with weight decay have better generalization ability than NTK, which shows that studying L2 regularized neural networks can provide better research ideas for generalization problems. Link to this paper:

https://nips.cc/Conferences/2019/Schedule?showEvent=14579

There are also several papers at this year's NeurIPS that show that traditional neural networks can have better performance than NTK:

What Can ResNet Learn Efficiently, Going Beyond Kernels? Paper link: http://papers.nips.cc/paper/9103-what-can-resnet-learn-efficiently-going-beyond-kernels
Limitations of Lazy Training of Two-layers Neural Network, paper link: http://papers.nips.cc/paper/9111-limitations-of-lazy-training-of-two-layers-neural-network

Many papers analyze the performance of different components of neural networks. For example, Chulhee Yun et al. proposed "Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity", which shows that "a 3-layer ReLU network with O(sqrt(N)) hidden nodes can perfectly memorize most datasets with N data points."

Shirin Jalali et al. raised the following question at the beginning of their paper "Efficient Deep Learning of Gaussian Mixture Models": The Universal approximation theorem shows that any regular function can be approximated by a single hidden layer neural network.

So, does increasing depth make it more efficient? They show that in the case of optimal Bayesian classification of Gaussian mixture models, these functions can be approximated with arbitrary accuracy using o(exp(n)) nodes in a neural network with a single hidden layer, and only o(n) nodes in a two-layer network.

In a more practical paper "Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence", Fengxiang He and his team trained 1,600 ResNet-110 and VGG-19 models on the CIFAR dataset using stochastic gradient descent (SGD) and found that the generalization ability of these models is negatively correlated with the batch size, positively correlated with the learning rate, and negatively correlated with the ratio of "batch size/learning rate".

Paper link: https://papers.nips.cc/paper/8398-control-batch-size-and-learning-rate-to-generalize-well-theoretical-and-empirical-evidence

Figure 2: The relationship between test accuracy and batch size and learning rate. The fourth row is (1) ResNet-110 model trained with CIFAR-10 dataset (2) ResNet-110 model trained with CIFAR-100 dataset (3) VGG-19 model trained with CIFAR-10 dataset (4) VGG-19 model trained with CIFAR-100 dataset. Each curve is drawn based on the comprehensive situation of 20 networks.

Meanwhile, Yuanzhi Li et al.’s paper “Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks” states: “A two-layer network trained with a large initial learning rate and annealing has better generalization performance than the same network trained with a smaller initial learning rate. This is because the model with a smaller learning rate first memorizes low-noise, hard-to-fit patterns, and its generalization performance in higher-noise, easy-to-fit situations is worse than that with a larger learning rate.”

Paper address: https://arxiv.org/abs/1907.04595

Although these theoretical analyses are very attractive and important, it is difficult to aggregate them into a large research system because each of them focuses on a relatively narrow aspect of the entire system.

New deep learning methods

At this year's NeurIPS, researchers proposed a series of novel methods that did not just add new network layers to other people's work. The three directions I am interested in are: Bayesian learning, graph neural networks, and convex optimization.

1. Using Bayesian principles for deep learning

As Emtiyaz Khan emphasized in his talk “Deep Learning with Bayesian Principles”, there is a big difference between Bayesian learning and deep learning.

According to Khan, deep learning uses a "trial and error" approach, where we experiment to see what results we get, whereas the Bayesian principle forces you to consider a hypothesis (a prior) in advance.

Figure 3: Comparison between Bayesian learning and deep learning

Bayesian deep learning has two main advantages over conventional deep learning: non-deterministic estimation and better generalization performance on small datasets.

In real-world applications, it is not enough to have a system that can make predictions. It is important to understand the reliability of each prediction. For example, when predicting cancer, the treatment plan is different when the reliability is 50.1% and when the reliability is 99.9%. In Bayesian learning, uncertain estimates are an inherent property.

Traditional neural networks give single point estimates—they output a prediction for a data point using a set of weights. Bayesian neural networks, on the other hand, use a probability distribution over the network’s weights and output the average prediction over all the combinations of weights in that distribution, which is the same as averaging over many neural networks.

Bayesian neural networks are therefore a natural ensemble that acts like a regularizer and prevents overfitting.

Training Bayesian neural networks with millions of parameters still requires a very high computational overhead. It may take weeks for the network to converge to a posterior, so approximate methods such as variational inference are becoming more and more popular. A total of 10 papers in the "Probabilistic Methods-Variational Inference" section of this year's NeurIPS are related to this type of variational Bayesian method.

Here are three papers on Bayesian deep learning recommended to you at this year's NeurIPS:

Importance Weighted Hierarchical Variational Inference (https://arxiv.org/abs/1905.03290)
A Simple Baseline for Bayesian Uncertainty in Deep Learning (https://arxiv.org/abs/1902.02476)
Practical Deep Learning with Bayesian Principles (https://arxiv.org/abs/1906.02506)

2. Graph Neural Network (GNN)

Over the years, I have often said that graph theory is one of the most underrated topics in machine learning, and I am very happy that work on graphs has been highlighted at NeurIPS this year.

"Learning Graph Representations" was the most popular workshop at this year's NeurIPS. It's amazing how much progress has been made in this field. Back in 2015, when I started working on graph neural networks during my internship, I didn't expect that so many researchers would be involved in this field.

Graphs are an elegant and natural representation for many kinds of data (e.g., social networks, knowledge bases, states of games). User-item data for recommender systems can be represented as a bipartite graph, where one disjoint set consists of users and the other consists of items.

Graphs can also represent the output of neural networks. As Yoshua Bengio reminded us in his talk: any joint distribution can be represented by a factor graph.

This makes graph neural networks perfectly suited for tasks such as combinatorial optimization (e.g., the traveling salesman problem, task scheduling problem), identity matching (are Twitter users and Facebook users the same?), and recommender systems.

The most popular graph neural network is the graph convolutional neural network (GCNN), which is expected because both graphs and convolutions can encode local information. Convolutions encode a bias to find the relationship between neighboring parts of the input. Graphs encode the most closely related parts of the input through edges.

Figure 4: (Left) The bipartite graph St ₌ (G, C, E, V) has n = 3 variables and m = 2 constants. (Right) The bipartite graph GCNN architecture used to parameterize the policy _πθ (a|st _{)
.}

Here are some GNN papers I recommend:

Exact Combinatorial Optimization with Graph Convolutional Neural Networks, paper address: https://arxiv.org/abs/1906.01629
Yes, there is a paper this year that combines the two hottest research trends of NTK and graph neural networks: Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels, paper address: https://arxiv.org/abs/1905.13192
My favorite poster presentation at this year’s NeurIPS: (Nearly) Efficient Algorithms for the Graph Matching Problem on Correlated Random Graphs, paper address: https://arxiv.org/abs/1805.02349

Figure 5: (Nearly) Efficient Algorithms for the Graph Matching Problem on Correlated Random Graphs

Recommended reading (besides the NeurIPS paper):

Thomas N. Kipf’s blog post on graph convolutional networks
A Gentle Introduction to Graph Neural Networks (Basics, DeepWalk, GraphSage) by Kung-Hsiang, Huang

3. Convex Optimization

I have been a quiet admirer of Stephen Boyd’s work on convex optimization, so I was happy to see it gaining popularity at NeurIPS. There were 32 papers on the topic at NeurIPS this year.

Stephen Boyd and J. Zico Kolter’s lab also presented their paper “Differentiable Convex Optimization Layers” which shows how solutions to convex optimization problems can be differentiated, which allows them to be embedded in differentiable programs (such as neural networks) and learn from data.

Paper link: http://papers.nips.cc/paper/9152-differentiable-convex-optimization-layers

Convex optimization problems are attractive because they can be solved exactly (with an error tolerance of 1e-10) and very quickly. They also do not produce strange or unexpected outputs, which is critical for real-world applications. Although many problems encountered in real-world scenarios are non-convex, decomposing them into a series of convex problems can achieve good results.

Neural networks are also trained using convex optimization algorithms. However, neural networks emphasize learning from scratch in an end-to-end manner, whereas applications of convex optimization problems explicitly model the system using domain-specific knowledge. If the system can be explicitly modeled in a convex way, then much less data is usually required. Work on differentiable convex optimization layers is one way to combine the advantages of end-to-end learning and explicit modeling.

Convex optimization is particularly useful when you want to control the output of a system. For example, SpaceX uses convex optimization to launch rockets and BlackRock uses it for trading algorithms. It's really cool to see convex optimization used in deep learning, just like it's now used in Bayesian learning.

Here are some NeurIPS papers on convex optimization recommended by Akshay Agrawal:

Acceleration via Symplectic Discretization of High-Resolution Differential Equations, paper link: https://papers.nips.cc/paper/8811-acceleration-via-symplectic-discretization-of-high-resolution-differential-equations
Hamiltonian descent for composite objectives, paper link: http://papers.nips.cc/paper/9590-hamiltonian-descent-for-composite-objectives

Figure 6: Used for question

Comparison of Hamiltonian Descent (HD) and Gradient Descent Algorithms

Neuroscience x Machine Learning

According to the analysis of Hugo Larochelle, chair of the NeurIPS 2019 program committee, the category with the highest acceptance rate is neuroscience. In Yoshua Bengio's speech "From System 1 Deep Learning to System 2 Deep Learning" and Blaise Aguera y Arcas's speech "Social Intelligence", they both urged the machine learning research community to think more about the biological roots of natural intelligence.

Figure 7: Neuroscience is the category with the highest acceptance rate

Bengio's speech introduced "consciousness" into the mainstream machine learning vocabulary. The core of Bengio's concept of "consciousness" is attention. He compared the machine attention mechanism with the way our brain chooses to allocate attention: "Machine learning can be used to help brain scientists better understand consciousness, but our understanding of consciousness can also help machine learning develop better capabilities."

According to Bengio, if we want machine learning algorithms to generalize to out-of-distribution samples, then consciousness-inspired approaches could be a solution.

Figure 8: Applying machine learning to consciousness & applying consciousness to machine learning - (1) formally define and test specific hypotheses about consciousness (2) demystify consciousness (3) understand the advantages of the evolution of consciousness from a computational and statistical perspective (e.g., systematic generalization) (4) apply these advantages to learning agents.

My favorite talk of the conference was by Aguera y Arcas. His talk was very theoretically rigorous, but also actionable. He argued that optimization methods are not enough to achieve human-like intelligence: "Optimization is not the way living things work. Brains don't just evaluate a function. They evolve. They self-correct. They learn from experience. Just a function doesn't capture these things."

He called for research into “a more general, biologically inspired synaptic update rule that allows for the use of loss functions and gradient descent but does not require it.”

This trend at NeurIPS coincides with what I have observed: many researchers in the AI community are turning to neuroscience. They are bringing neuroscience back into the field of machine learning.

Some of the smartest people I know have left AI research to go into industry or neuroscience. Why?

1. We need to understand how humans learn so we can teach machines to learn.

2. Scientific research should be a process from hypothesis to experiment, but today's artificial intelligence research often involves conducting experiments first and then proving the results.

Keyword analysis

Let's take a more macro view and see what topics the papers at this year's NeurIPS conference are related to. First, I used Vennclods to visualize the titles of 1,011 NeurIPS 2018 papers and 1,428 NeurIPS 2019 papers. The black part in the middle is a list of keywords that are very common in these two years.

Figure 9: NeurIPS keyword cloud

Next, as shown in the figure below, I calculated the percentage change of these keywords from 2018 to 2019. For example, if in 2018, 1% of all accepted papers contained the keyword "X", and in 2019, this number was 2%, then the change in this proportion is (2-1) / 1 = 100%. In the figure below, I plotted the keywords with an absolute change of more than 20%.

Figure 10: Percentage changes in NeurIPS keywords

Key points:

Even outside of robotics, reinforcement learning has been further developed. Keywords with significant positive changes are: multi-armed bandit, feedback, regret, control.
Generative models are still popular. GANs still capture our imagination, but the hype is less.
Recurrent neural networks and convolutional neural networks continued the downward trend of last year.
Keywords related to hardware are also increasing, which indicates that more algorithms are born with hardware in mind. This is a solution to the problem of "hardware becoming a bottleneck in machine learning."
Unfortunately, the percentage of the keyword “data” is on a downward trend. I went to the “Algorithms–Missing Data” poster exhibition with great excitement, but found that there was only one poster posted: “Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption”!
The keyword "meta" has grown the most this year. For more details, please refer to Jesse Mu's "Meta-meme" (https://twitter.com/jayelmnop/status/1206637800537362432)
While the term “Bayesian” has decreased in usage, “non-determinism” has increased. Last year, there were many papers that used Bayesian principles, but not in the context of deep learning.

NeurIPS Key Statistics at a Glance

Of the more than 7,000 papers submitted to the conference, 1,428 were accepted, with an acceptance rate of 21%.
I estimate that at least half of the 13,000+ attendees did not present a paper during the conference.
57 workshops, including 4 focused on inclusion: Black in AI, Women in Machine Learning, LatinX in AI, Queer in AI, New In Machine Learning, Machine Learning Competitions for All.
More than 16,000 pages of meeting records
Of all accepted papers, 12% included at least one author from Google or DeepMind.
There are 87 papers from Stanford, which is the academic institution with the most papers accepted for this year's NeurIPS.
There are 250 papers on applications, accounting for 16.7% of the total number of papers.
648 is the number of citations of Lin Xiao’s “Dual Averaging Method for Regularized Stochastic Learning and Online Optimization”, the winner of the Test of Time Paper Award at this conference. This proves that citations are not necessarily related to contributions.
75% of papers provide code links in the "camera-ready" version, compared to only 50% last year.
2,255 review comments mentioned reviewing submitted code.
173 papers claimed to have completed the OpenReview reproducibility challenge.
31 posters were presented at the "Machine Learning for Creativity and Design" workshop at NeurIPS this year. Several people told me that this was their favorite session of the conference.
Give Good Kid a shout out for their performance at the closing party! If you haven’t heard their music yet, check out Spotify.

Sometimes they're machine learning researchers. Sometimes they're rock stars. Tonight, they're both!

The "Retrospectives: A Venue for Self-Reflection in ML Research" workshop had 11 talks and was one of everyone's favorite sessions.

In addition, the hot atmosphere of this year's NeurIPS is also quite eye-catching. You can review the article "Academic conference with 13,000 participants, should we celebrate or reflect?"

Conclusion

NeurIPS was overwhelming both intellectually and socially. I don’t think anyone could read 16,000 pages of conference proceedings. The poster sessions were packed, which made it hard to talk to authors. I definitely missed a lot.

However, the large scale of the conference also means that many research directions and related researchers are gathered together. It is a good feeling to be able to understand the work outside my own research subfield and to learn from researchers whose research backgrounds and interests are different from mine.

It’s great to see the research community moving away from the misconception that “bigger is better.” My impression from visiting the poster session was that many papers only experimented on small datasets, such as MNIST and CIFAR. The Best Paper Award winner, “Distribution-Independent PAC Learning of Halfspaces with Massart Noise” by Ilias Diakonikolas et al., did not have any experiments.

I often hear young researchers worry that they have to join a large research lab to get access to computing resources, but NeurIPS proves that you can make important contributions without having to worry about data and computing issues.

At a NewInML roundtable I attended, someone said he had no idea how most papers at NeurIPS would be used in production, and Neil Lawence suggested that maybe he should consider attending other conferences.

NeurIPS is more theoretical than many other machine learning conferences - it's important to do basic research.

Overall, I had a great time at NeurIPS and plan to attend next year urgently. However, for those who are new to the machine learning research community, I would recommend ICLR as their first academic conference. ICLR is smaller, shorter, and more application-oriented. Next year, ICLR will be held in Ethiopia, which is an amazing country!

Via https://huyenchip.com/2019/12/18/key-trends-neurips-2019.html

Previous recommendations

▎When JD.com started to have “no” CTO

▎Within less than a year of mass production, AI voice chip shipments have reached millions! The secret of the sensation in the market

▎What do the 30 female CEO stories translated by AI look like?

▎iPad 2019 official website price reduction; Luo Zhenyu teased Wang Sicong: It's not easy to be a rich second generation; Huawei's self-made news program reviewed its 2019 performance

100,000 AI talents vote for you, companies scan the QR code to register