Engineering is the bottleneck in (Deep Learning) Research

Warning: This a rant post containing a bunch of unorganized thoughts.

When I was in graduate school working on NLP and information extraction I spent most of my time coding up research ideas. That’s what grad students with advisors who don’t like to touch code, which are probably 95% of all advisors, tend to do. When I raised concerns about problems I would often hear the phrase “that’s just an engineering problem; let’s move on”. I later realized that’s code speech for “I don’t think a paper mentioning this would get through the peer review process”. This mindset seems pervasive among people in academia. But as an engineer I can’t help but notice how the lack of engineering practices is holding us back.

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

Waste of research time (difficulty of building on other’s work)

Researchers are highly trained professionals. Many have spent years or decades getting PhDs and becoming experts in their respective fields. It only makes sense that those people should spend the majority of their time doing what they’re good at – innovating by coming up with novel techniques. Just like you wouldn’t want a highly trained surgeon spending several hours a day inputting patient data from paper forms. But that’s pretty much what’s happening.

In an ideal world, a researcher with an idea could easily build on top of what has already been done (the 90% I mention above) and have 10% of work left to do in order to test his or her hypothesis. (I realize there are exceptions to this if you’re doing something truly novel, but the majority of published research falls into this category). In practice, quite the opposite is happening. Researchers spend weeks re-doing data pre- and post-processing and re-implementing and debugging baseline models. This often includes tracking down authors of related papers to figure out what tricks were used to make it work at all. Papers tend to not mention the fine print because that would make the results look less impressive. In the process of doing this, researchers introduce dozens of confounding variables, which essentially make the comparisons meaningless. But more on that later.

What I realized is that the difficulty of building upon other’s work is a major factor in determining what research is being done. The majority of researchers build on top of their own research, over and over again. Of course, one may argue that this is a result becoming an expert in some very specific subfield, so it only makes sense to continue focusing on similar problems. While no completely untrue, I don’t think that’s what’s happening (especially in Deep Learning, where many subfields are so closely related that knowledge transfers over pretty nicely). I believe the main reason for this is that it’s easiest, from an experimental perspective, to build upon one’s own work. It leads to more publications and faster turnaround time. Baselines are already implemented in familiar code, evaluation is setup, related work is written up, and so on. It also leads to less competition – nobody else has access to your experimental setup and can easily compete with you. If it were just as easy to build upon somebody else’s work we would probably see more diversity in published research.

It’s not all bad news though. There certainly are a few trends going into the right direction. Publishing code is becoming more common. Software packages like OpenAI’s gym (and Universe), ensure that at least evaluation and datasets are streamlined. Tensorflow and other Deep Learning frameworks remove a lot of potential confounding variables by implementing low-level primitives. With that being said, we’re still a far cry from where we could be. Just imagine how efficient research could be if we had standardized frameworks, standard repositories of data, well-documented standard code bases and coding styles to build upon, and strict automated evaluation frameworks and entities operating on exactly the same datasets. From an engineering perspective all of these are simple things – but they could have a huge impact.

I think we’re under-appreciating the fact that we’re dealing with pure software. That sounds obvious, but it’s actually a big deal. Setting up tightly controlled experiments in fields like medicine or psychology is almost impossible and involves an extraordinary amount of work. With software it’s essentially free. It’s more unique than most of us realize. But we’re just not doing it. I believe one reason for why these changes (and many others) are not happening is a misalignment of incentives. Truth the told, most researchers care more about their publications, citations, and tenure tracks than about actually driving the field forward. They are happy with a status quo that favors them.

Lack of rigor

The second problem is closely related to the first. I hinted at it above. It’s a lack of rigor and reproducibility. In an ideal world, a researcher could hold constant all irrelevant variables, implement a new technique, and then show some improvement over a range of baselines within some margin of significance. Sounds obvious? Well, if you happen to read a lot Deep Learning papers this sounds like it’s coming straight from a sci-fi movie.

In practice, as everyone re-implements techniques using different frameworks and pipelines, comparisons become meaningless. In almost every Deep Learning model implementation there exist a huge number “hidden variables” that can affect results. These include non-obvious model hyperparameters baked into the code, data shuffle seeds, variable initializers, and other things that are typically not mentioned in papers, but clearly affect final measurements. As you re-implement your LSTM, use a different framework, pre-process your own data, and write thousands of lines of code, how many confounding variables will you have created? My guess is that it’s in the hundreds to thousands. If you then show a 0.5% marginal improvement over some baseline models (with numbers usually taken from past papers and not even averaged across multiple runs) how can you ever prove causality? How do you know it’s not a result of some combination of confounding variables?

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Again, “just engineering” has the potential to solve this. The solution is pretty much equivalent to problem #1 (standard code, datasets, evaluation entities, etc), but so are the problems. In fact, it may not even be in the best interest of researchers to publish readable code. What if people found bugs in it and you need to retract your paper? Publishing code is risky, without a clear upside other than PR for whatever entity you work for.


Deep Learning Startups, Applications and Acquisitions – A Summary

Most major tech companies are use Deep Learning techniques in one way or another, and many have new initiatives on the way. Self-driving cars use Deep Learning to model their environment. Siri, Cortana and Google Now use it for speech recognition, Facebook for facial recognition, and Skype for real-time translation.

Naturally there are a lot of startups doing cool things in the space. I tried to do my best to categorize the companies below based on where their main focus seems to be. If you’re a Deep Learning company and I forgot you, please do let me know!

General / Infrastructure

Because Deep Learning is such a generic approach, some companies are focusing on creating infrastructure, algorithms, and tools that can be applied across a variety of domains.

DeepMind, which was acquired by Google for more than $500M in 2014, is working on general-purpose AI algorithms using a combination of Deep Learning and Reinforcement Learning. DeepMind is the company behind an algorithm that learns to play Atari games better than humans. It is a largely a research company and does not provide products for use by businesses or consumers.

MetaMind focuses on providing cutting-edge performance for image and natural language classification tasks. Richard Socher, the founder of Metamind, is very active in the academic community and teaches Stanford’s Deep Learning for Natural Language Processing class. The company offers a cloud service to train Deep Learning classifiers.

Nervana is the company behind the open source Python-based neon framework, a GPU-optimized library to build Deep Learning architectures. Nervana also provides a cloud services where it runs algorithms on proprietary hardware specifically designed for Deep Learning. Nervana raised $20.5M in a June 2015 round led by Data Collective.

Skymind is the company behind the Deeplearning4j framework. Deeplearning4j makes efficient use of GPUs and integrates with distributed systems such as Hadoop and Spark to scale to large data sets. Skymind sells an enterprise editions of its software together with training and support.

Ersatz Labs offers a cloud service to manage data and train Deep Learning models through a web interface (video) or an API. Pricing is based on minutes of GPU time used.

Computer Vision

It would be fair to say that Deep Learning gained most of its popularity through excellent performance on a variety of computer visions tasks: Recognizing objects in images, understanding scenes, and finding semantically similar images. Convolutional Neural Networks (CNNs), a popular type of Deep Learning architecture, are now considered the standard for most of the above. The rapid success of Deep Learning in Computer Vision has spurred a lot of startup activity.

Madbits was acquired by Twitter in 2014  before it got a chance to launch publicly. In its own words, it “built visual intelligence technology that automatically understands, organizes and extracts relevant information from raw media (images)”.

Perceptio was acquired by Apple In October 2015 while still in stealth mode. The website was shut down after the acquisition, but Perceptio seems to have been developing technology to run image-classifications algorithms on smartphones.

Lookflow was acquired by Yahoo/Flickr in October 2013. It’s unclear what exactly Lookflow was offering, but it was using Deep Learning algorithms for image classifications to help organize photos.

HyperVerge builds technology for a range of visual recognition tasks, including facial recognition, scene recognition, and image similarity search. HyperVerge is also working on a smart photo organization app called Silver. The company came out of  IIT and raised a $1M seed round from NEA in August 2015.

Deepomatic builds object recognition technology to identify products (e.g. shoes) in images, which can then be monetized through e-commerce links. It focuses on the fashion vertical and has raised $1.4M from Alven Capital (a French VC) and Angels in September 2015.

Descartes Labs focuses on understanding large datasets of images, such as satellite images. An example use case is tracking agriculture development across the country. Descartes Labs came out of the Los Alamos National Laboratory and has raised $3.3M of funding to date.

Clarifai uses CNNs to provide an API for image and video tagging. In April 2015, Clarifai raised a $10M Series A led by USV.

Tractable trains image classifiers to automate inspection tasks currently done by humans, for example detecting cracks on industrial pipes or inspecting cars.

Affectiva classifies emotional reactions based on facial images. It raised $12 million in Series C funding Horizon Ventures and Mary Meeker and Kleiner Perkins in 2012.

Alpaca is the company behind Labellio, a cloud service to build your own deep learning image classifier using a graphical interface.

Orbital Insight uses Deep Learning to analyze satellite imagery and understand global and national trends.

Natural Language

After the rapid success in Computer Vision, researchers were quick in adopting Deep Learning techniques for Natural Language Processing (NLP) tasks. In fact, the exact same algorithm that categorizes images can be used to analyze text. Since then, new Deep Learning techniques specifically for NLP have been developed, and are being applied to tasks such as categorizing text, finding content themes, analyzing sentiment, recognizing entities, or answering free-form questions.

AlchemyAPI was acquired by IBM (Watson group) in March 2015. It provides a range of  Natural Language Processing APIs, including Sentiment Analysis, Entity Extraction and Concept Tagging.  (AlchemyAPI also provides computer vision APIs, but their primary product seems to be language-related so I decided to put them in this category).

VocalIQ was working on a conversational voice-dialog system before being acquired by Apple in October 2015.

Idibon develops general-purpose NLP algorithms that can be applied to any language. Idibon’s public API does Sentiment Analysis for English, but more languages, and  support for Named Entity Recognition are coming soon. Idibon raised a $5.5M Series A led by Altpoin, Khosla, and Morningside Ventures in October 2014.

Indico provides a variety of Natural Language APIs based on Deep Learning models. APIs include Text Tagging, Sentiment Analysis, Language Prediction, and Political Alignment Prediction.

Semantria provides APIs and Excel plugins to perform various NLP tasks in 10+ languages. Pricing starts at $1,000/month for both Excel plugins and API access. Lexalytics, an on-premise NLP platform, acquired Semantria in 2014.

ParallelDots provides APIs for Semantic Proximity, Entity Extraction, Taxonomy Classification and Sentiment Analysis, as well as tools for social media analytics and automated timeline construction. 

Xyggy  is a search engine for all data types (text and non-text) represented by deep-learning vectors. With text for example, a search can be with keywords, snippets or entire documents to find documents with similar meaning.


Instead of focusing on general-purpose vision or language applications, some companies are applying Deep Learning techniques to specific verticals. My research surfaced mostly Healthcare companies, but It’s likely that many others are using Deep Learning without explicitly mentioning in on their website.

Enlitic applies deep learning techniques to medical diagnostics. By classifying x-rays, MRIs and CT scans, Enlitic can recognize early signs of cancer more accurately than humans. The company raised $3M from undisclosed investors in February 2015.

Quantified Skin uses selfies to track and analyze a person’s skin and recommends beneficial products and activities. The company raised a total of $280k in 3 rounds.

Deep Genomics uses Deep Learning to classify and interpret genetic variants. Its first product is SPIDEX, a dataset of genetic variants and their predicted effects.

StocksNeural uses Recurrent Neural Networks to predict stock prices based on historical time-series data.

Analytical Flavor Systems use Deep Learning to understand what people taste and optimize food and beverage production.

Artelnics builds open source libraries and graphical users interfaces to train Deep Learning models for a variety of industries.


Are there any Deep Learning startups I missed? I’d love to hear about them in the comments.