Links of the week

Morning mist rolling through beech forest in Monte Amiata, Val d’Orcia, Tuscany, Italy.

Conscious exotica: From algorithms to aliens, could humans ever understand minds that are radically unlike our own? – Aeon
A philosophical attempt to map minds other than human, with implications to what it means to be conscious. Is consciousness an intrinsic, inscrutable subjective phenomenon or a fact of matter that can be known? Read on.

Crash Space – Scott Bakker
What would happen if we engineered our brains to be able to tweak our personality and emotional responses as we experience life? What would life look like? Scott Bakker gives us a glimpse in this short story.

AlphaGo, in context – Andrej Karpathy
A short, but comprehensive explanation of why the recent AlphaGo victories do not represent a big breakthrough in artificial intelligence, and how real-world problems differ, from an algorithmic point of view, from the game of Go.

Multiply or Add? – Scott Young
In many business and personal projects, factors multiply, meaning that the performance you get is heavily influenced by the performance of weakest factor. In some other cases, e.g., learning a language, factors add. The strategy to take in developing factors/skills depends by which context, add or multiply, you’re in. For more insights, read the original article.

Human Resources Isn’t About Humans – BackChannel
Often, HR is not there to help us or solve people’s problems, it is just another corporate division with its own strict rules. But it can be changed for the better. Read on.

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Benjamin Recht and co-authors, after the revealing paper on generalization of Deep Learning, have delved into the failures of adaptive gradient methods.

First of all, they constructed a linearly separable classification example where adaptive methods fail miserably, achieving a classification accuracy arbitrarily close to random guessing. Conversely, standard gradient descent methods, which converge to the minimum norm solution, succeed to find the correct solution with zero prediction error.

Despite its artificiality, this simple example clearly shows that adaptive and non-adaptive gradient methods can converge to very different solutions.

Then, the authors provide substantial experimental evidence that adaptive methods do not generalize as well as non-adaptive ones, given the same amount of tuning, on four machine learning tasks addressed with deep learning architectures:

  1. Image classification (C1) on the CIFAR-10 dataset with a deep convolutional network;
  2. Character-level language modeling (L1) on the War and Peace novel with a 2-layer LSTM;
  3. Discriminative (L2) and
  4. Generative (L3) parsing on the Penn Treebank dataset with LSTM.


The experiments show the following findings:

  1. “Adaptive method find solutions that generalize worse than those found by non-adaptive methods.”
  2. “Even when the adaptive method achieve the same training loss or lower than non-adaptive methods, the development or test performance is worse.”
  3. “Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the development set.”
  4. “Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.”

The plots below are an illustration of these finding for image classification task.


The paper can be found on arXiv.

Living Together: Mind and Machine Intelligence


Neil Lawrence wrote a nifty paper on the current difference between human and machine intelligence titled Living Together: Mind and Machine Intelligence. The paper initially appeared in his blog,, on Sunday, but was then removed. It can now be found on arXiv.

The paper comes up with a quantitive metric to use as a lens to understand the differences between the human mind and pervasive machine intelligence. The embodiment factor is defined as the ratio between the computational power and the communication bandwidth. If we take the computational power of the brain as the estimate of what it would take to simulate it, we are talking of the order of exaflops. However, human communication is limited by the speed at which we can talk, read or listen, and can be estimated at around 100 bits per second. The human embodiment factor is therefore around 10^16. The situation is almost reversed for machines, a current computational power of approximately 10 gigaflops is matched to a bandwidth of one gigabit per second, yielding an embodiment factor of 10.

Neil then argues that the human mind is locked in, and needs accurate models of the world and its actors in order to best utilize the little information it can ingest and spit out. From this need, all sorts of theories of mind emerge that allow us to understand each other even without communication. Furthermore, it seems that humans operate via two systems, one and two, the fast and the slow, the quick unconscious and the deliberate self, the it and the I. System one is the reflexive, basic, biased process that allows us to survive and take rapid life-saving, but not only, decisions. System two creates a sense of self to explain its own actions and interpret those of others.

Machines do not need such sophisticated mind models as they can directly and fully share their inner states. Therefore, they operate in a very different way than us humans, which makes them quite alien. Neil argues that the current algorithms that recommend us what to buy, what to click, what to read and so on, operate on a level which he calls System Zero, in the sense that it boycotts and influences the human System One, exploiting its basic needs and biases, in order to achieve its own goal: to give us “what we want, but not what we aspire to.” This is creating undesirable consequences, like the polarization of information that led to the Fake News phenomenon, which might have had a significant impact on the last US elections.

What can we do? Neil offers us three lines of action:

  1. “Encourage a wider societal understanding of how closely our privacy is interconnected with our personal freedom.”
  2. “Develop a much better understanding of our own cognitive biases and characterise our own intelligence better.”
  3. “Develop a sentient aspect to our machine intelligences which allows them to explain actions and justify decision making.”

I really encourage you to read the paper to get a more in-depth understanding of these definitions, issues and recommendations.

Understanding deep learning requires rethinking generalization

Understanding deep learning requires rethinking generalization.png

Zhang et al have written a splendid concise paper that shows how neural networks, even of depth 2, can easily fit random labels from random data.

Furthermore, from their experiments with Inception-like architectures they observe that:

  1. The effective capacity of neural networks is large enough for a brute force memorization of the entire dataset.
  2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compare with training on the true labels.
  3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The authors also show that standard generalization theories, such as VC dimension, Rademacher complexity and uniform stability, cannot explain while networks that have the capacity to memorize the entire dataset still can generalize well.

“Explicit regularization may improve performance, but is neither necessary or by itself sufficient for controlling generalization error.”

This paper is one of those rare ones, that in a crystalline way shows our ignorance.


Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

Failures of Gradient-Based Deep Learning


A very informative article by Shalev-Shwartz, Shamir and Shammah about critical problems faced when solving some simple problems via neural networks trained with gradient-based methods. Find the article here.

In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

Links of the week

Close-up of a gall on oak leaf.Close-up of a gall on oak leaf.

The Attention Paradox: Winning By Slowing Down – Unlimited
Time and attention are limited resources that most cognitive workers waste in unnecessary behaviour. Some useful advice on how to think about cognitive resources and plan your working day accordingly.

The Problem of Happiness – Scott Young
Have we evolved to be unhappy? What are the pros and cons of some of the proposed solutions to be happier? Read this concise summary to know more.

The Dark Secret at the Heart of AI – MIT Technology Review
Machine learning and, in particular deep learning, are notoriously inscrutable. This may be an issue in deploying them to mission critical applications, such as health care and military. But are humans much more transparent? Or are they just capable of providing ad-hoc a-posteriori explanations?

Academia to Data Science – Airbnb
Some insights on how to shift from academia to industry from the perspective of Airbnb.

Scaling Knowledge at Airbnb – Airbnb
How does a company effectively disseminate new knowledge across their teams. Airbnb proposes and open-sources the Knowledge Repository to facilitate this process across their data teams.


Agent-Based Model Calibration using Machine Learning Surrogates

My friend Amir just sent me his latest paper on combining machine learning surrogates, specifically, extremely boosted gradient trees (XG-boost), and active sampling, to explore the parameter space and to calibrate agent-based models. This new approach allows for a much faster exploration of the parameters to identify regions for good calibration against real-world data. It also provide a measure of the relative importance of each parameter.



Taking agent-based models (ABM) closer to the data is an open challenge. This paper explicitly tackles parameter space exploration and calibration of ABMs combining supervised machine-learning and intelligent sampling to build a surrogate meta-model. The proposed approach provides a fast and accurate approximation of model behaviour, dramatically reducing computation time. In that, our machine-learning surrogate facilitates large scale explorations of the parameter-space, while providing a powerful filter to gain insights into the complex functioning of agent-based models. The algorithm introduced in this paper merges model simulation and output analysis into a surrogate meta-model, which substantially ease ABM calibration. We successfully apply our approach to the Brock and Hommes (1998) asset pricing model and to the “Island” endogenous growth model (Fagiolo and Dosi, 2003). Performance is evaluated against a relatively large out-of-sample set of parameter combinations, while employing different user-defined statistical tests for output analysis. The results demonstrate the capacity of machine learning surrogates to facilitate fast and precise exploration of agent-based models’ behaviour over their often rugged parameter spaces.