Understanding deep learning requires rethinking generalization

Understanding deep learning requires rethinking generalization.png

Zhang et al have written a splendid concise paper that shows how neural networks, even of depth 2, can easily fit random labels from random data.

Furthermore, from their experiments with Inception-like architectures they observe that:

  1. The effective capacity of neural networks is large enough for a brute force memorization of the entire dataset.
  2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compare with training on the true labels.
  3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The authors also show that standard generalization theories, such as VC dimension, Rademacher complexity and uniform stability, cannot explain while networks that have the capacity to memorize the entire dataset still can generalize well.

“Explicit regularization may improve performance, but is neither necessary or by itself sufficient for controlling generalization error.”

This paper is one of those rare ones, that in a crystalline way shows our ignorance.


Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

Links of the week

Arches onto high cliff over the Mediterranean. Portovenere, Italy.

Deep Habits: The Importance of Planning Every Minute of Your Work Day – Study Hacks
How to increase your productivity by taking control of your time via time blocking.

Chaos, Ignorance and Newton’s Great Puzzle – Scott Young
Luck, chaos or ignorance? Understanding this mixture for your projects may help to better allocate resources.

Garry Kasparov on AI, Chess, and the Future of Creativity – Mercatus Center
A very interesting conversation with Garry Kasparov on chess, AI, Russian politics, education and creativity.

If everything is measured, can we still see one another as equals? – Justice Everywhere
The dangers of measuring everything and ranking ourselves on different scales, neglecting those human skills and experiences that cannot and should not quantified.

Failures of Gradient-Based Deep Learning


A very informative article by Shalev-Shwartz, Shamir and Shammah about critical problems faced when solving some simple problems via neural networks trained with gradient-based methods. Find the article here.

In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

Links of the week

Close-up of a gall on oak leaf.Close-up of a gall on oak leaf.

The Attention Paradox: Winning By Slowing Down – Unlimited
Time and attention are limited resources that most cognitive workers waste in unnecessary behaviour. Some useful advice on how to think about cognitive resources and plan your working day accordingly.

The Problem of Happiness – Scott Young
Have we evolved to be unhappy? What are the pros and cons of some of the proposed solutions to be happier? Read this concise summary to know more.

The Dark Secret at the Heart of AI – MIT Technology Review
Machine learning and, in particular deep learning, are notoriously inscrutable. This may be an issue in deploying them to mission critical applications, such as health care and military. But are humans much more transparent? Or are they just capable of providing ad-hoc a-posteriori explanations?

Academia to Data Science – Airbnb
Some insights on how to shift from academia to industry from the perspective of Airbnb.

Scaling Knowledge at Airbnb – Airbnb
How does a company effectively disseminate new knowledge across their teams. Airbnb proposes and open-sources the Knowledge Repository to facilitate this process across their data teams.


Book review: The Trails Less Travelled by Avay Shukla

I’ve always dreamed to hike the great Himalayas, but never made a concrete step in this direction. A year and half ago, in between jobs, I was truly thinking of going there, but then a good job offer came in the way. However, I’d been talking so much about it, that my partner decided to give The Trails Less Travelled by Avay Shukla to me as a Christmas gift. It sat on the book shelf for a bit more than a year, before I finally decided to open it…

The book describes several treks in the Himachal Himalayas, in the Northwestern Indian state of Himachal Pradesh. This mountain ranges also includes the Great Himalayan National Park, established in 1984, which covers an area of more than 1100 square km at an altitude between 1500m and 6000m. In June 2014, the park was added to the UNESCO list of World Heritage Sites.

The author belongs to the Indian Administrative Service and has served in Himachal Pradesh for 30 years. His reports from the remote valleys of Himachal contain both awe-inspired natural descriptions, but also poignant reminders of how the encroaching economic development may soon destroy these natural beauties. He does not refrain from criticizing his own employer, the government, for his lack of action to better preserve these unique valleys, but also to offer the local communities support for a more and more difficult way of life.

The region is full of culture, natural diversity, rich ecosystems and varying landscapes, from the jungle forests of the lower altitudes to the high pastures to the barren glacial terrains. The treks described in the book require strength, endurance, perseverance and some technical skills, as they often have to negotiate deep gorges, boulder-strewn river beds and glacier crossings. But they also offers plenty of rewards, from crystalline lakes to rare wildlife sightings to small temples found in the most remote of passes.

On one hand, I would like to immediately go and venture in Himachal, on the other, I’m afraid that some of these treks would be unrecognizable 10 years after the author walked them. It’s yet another reminder that if we want to preserve these natural wonders for the future generations, we have little time to act and a lot to do.


Links of the week

Ski-mountaineers climbing the last steep meters to the summit of the Bishorn (4153m) in Valais, Switzerland

Time And Tide Wait For No Economist – UNLIMITED
The changing market of time and how the leisure time gap is widening between skilled and unskilled labour.

The Simple Economics of Machine Intelligence – Harvard Business Review
AI-based prediction tasks will get cheaper and cheaper, but the value of still-to-be-automatized complementary tasks, such as judgement will increase. A simple, but effective, economic perspective on the impact of AI.

Do you need a Data Engineer before you need a Data Scientist? – Michael Young
How Data Engineer and Data Architects can make your Data Science team more effective and satisfied.

The Art of the Finish: How to Go From Busy to Accomplished – Cal Newport
How task-based planning makes you productive, but not accomplished. A simple strategy to change that.

Data Science jargon buster – for Data Scientists – Guerrilla Analytics
Do your data scientists confuse your customers. Here’s a useful translating table.


Book review: So Good They Can’t Ignore You by Cal Newport


After having read Deep Work, been a follower of Study Hacks, and checked the Top Performers course (yet to take it though), I was curious to read Cal Newport’s book about career advice: So Good They Can’t Ignore You.

It does punch like the title, delivering immediately actionable advice on how best to steer, improve and leverage your career to get your dream job.

How does it all play out then? By following four simple rules (and corollary laws).

Rule #1: Don’t follow your passion.
First of all, we very rarely know what our passions truly are. It’s more the norm to become passionate about something we do really well. Secondly, passion is dangerous, since it can lead you to jump onto options for which you do not have the necessary skills. Thirdly, by trying to follow your passion, you end up assessing each job opportunity according to what it offer you, instead of what value your are producing.

Rule #2: Be so good they can’t ignore it (or, the importance of skill)
One needs to develop rare and valuable skills, a career capital, in order to trade them for better and better jobs. These skills are best acquired via the craftsman mindset, “a focus on what value you’re producing in the job” and through deliberate practice, “an approach to work where you deliberately stretch your abilities beyond where you’re comfortable and then receive ruthless feedback on your performance.” (more on this in Deep Work).

Rule #3: Turn down a promotion (or, the importance of control)
So now that your have built up your career capital, what do you trade it for? One of the most powerful traits to acquire is control over what you do, and how you do it. Deciding how much and where to work. Control has it traps though.
The first control trap states that “control that is acquired without career capital is not sustainable.”
The second control trap is that “the point at which you have acquired enough career capital to get meaning control is exactly the point when you’ve become valuable enough to your current employer that they will try to prevent your from making the change.”
In order to avoid these traps, one should follow the Law of financial viability, which briefly states that you should always check your desired changes against the willingness of people to pay for it.

Rule #4: Think Small, Act Big (or, the importance of mission)
Another fundamental source of satisfaction of your work is having a mission, but finding such a mission is not an easy task. Like control, mission also requires career capital: having a clear defined mission but no skills to carry it out will only leave you unsatisfied and looking for another job to pay for your bills. Ok. You’ve got the necessary skills, but still lack a driving mission. How do you find it? Cal argues that great missions are found in the adjacent possible of your field, meaning you first need to become an expert to spot new fruitful directions. Exactly like in science. Great discoveries are found at the edges of the current knowledge. Good. You found a possible direction. Do you jump head on into it? No, you take small bets in many of these direction, in order to probe what’s truly feasible, and also remarkable. A small bet is transformed into a compelling mission and then into a great success if it satisfies the law of remarkability, “which requires than an idea inspires people to remark about it, and is launched in a venue where such remarking in made easy.” Example? Intriguing scientific discoveries in peer-review journals and innovative software in open-source GitHub repositories.

That’s a quite concise summary of the book. In order to dig deeper into the arguments behind these rules and laws, and read many peoples’ stories, successful and not, you ought to read the whole book. At 230 pages at large font is a fast read, but you’ll come back to some chapters multiple times, to adjust your understanding to your current career situation.

Personally, I found the advice clear, which is not always the case, sound, which is even less so, and immediately applicable. Overall, what’s best about the book is that it frames career development and finding the dream job in very practical and no-nonsense terms.

Buy it here.