Engineering is the bottleneck in (Deep Learning) Research

Warning: This a rant post containing a bunch of unorganized thoughts.

When I was in graduate school working on NLP and information extraction I spent most of my time coding up research ideas. That’s what grad students with advisors who don’t like to touch code, which are probably 95% of all advisors, tend to do. When I raised concerns about problems I would often hear the phrase “that’s just an engineering problem; let’s move on”. I later realized that’s code speech for “I don’t think a paper mentioning this would get through the peer review process”. This mindset seems pervasive among people in academia. But as an engineer I can’t help but notice how the lack of engineering practices is holding us back.

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

Waste of research time (difficulty of building on other’s work)

Researchers are highly trained professionals. Many have spent years or decades getting PhDs and becoming experts in their respective fields. It only makes sense that those people should spend the majority of their time doing what they’re good at – innovating by coming up with novel techniques. Just like you wouldn’t want a highly trained surgeon spending several hours a day inputting patient data from paper forms. But that’s pretty much what’s happening.

In an ideal world, a researcher with an idea could easily build on top of what has already been done (the 90% I mention above) and have 10% of work left to do in order to test his or her hypothesis. (I realize there are exceptions to this if you’re doing something truly novel, but the majority of published research falls into this category). In practice, quite the opposite is happening. Researchers spend weeks re-doing data pre- and post-processing and re-implementing and debugging baseline models. This often includes tracking down authors of related papers to figure out what tricks were used to make it work at all. Papers tend to not mention the fine print because that would make the results look less impressive. In the process of doing this, researchers introduce dozens of confounding variables, which essentially make the comparisons meaningless. But more on that later.

What I realized is that the difficulty of building upon other’s work is a major factor in determining what research is being done. The majority of researchers build on top of their own research, over and over again. Of course, one may argue that this is a result becoming an expert in some very specific subfield, so it only makes sense to continue focusing on similar problems. While no completely untrue, I don’t think that’s what’s happening (especially in Deep Learning, where many subfields are so closely related that knowledge transfers over pretty nicely). I believe the main reason for this is that it’s easiest, from an experimental perspective, to build upon one’s own work. It leads to more publications and faster turnaround time. Baselines are already implemented in familiar code, evaluation is setup, related work is written up, and so on. It also leads to less competition – nobody else has access to your experimental setup and can easily compete with you. If it were just as easy to build upon somebody else’s work we would probably see more diversity in published research.

It’s not all bad news though. There certainly are a few trends going into the right direction. Publishing code is becoming more common. Software packages like OpenAI’s gym (and Universe), ensure that at least evaluation and datasets are streamlined. Tensorflow and other Deep Learning frameworks remove a lot of potential confounding variables by implementing low-level primitives. With that being said, we’re still a far cry from where we could be. Just imagine how efficient research could be if we had standardized frameworks, standard repositories of data, well-documented standard code bases and coding styles to build upon, and strict automated evaluation frameworks and entities operating on exactly the same datasets. From an engineering perspective all of these are simple things – but they could have a huge impact.

I think we’re under-appreciating the fact that we’re dealing with pure software. That sounds obvious, but it’s actually a big deal. Setting up tightly controlled experiments in fields like medicine or psychology is almost impossible and involves an extraordinary amount of work. With software it’s essentially free. It’s more unique than most of us realize. But we’re just not doing it. I believe one reason for why these changes (and many others) are not happening is a misalignment of incentives. Truth the told, most researchers care more about their publications, citations, and tenure tracks than about actually driving the field forward. They are happy with a status quo that favors them.

Lack of rigor

The second problem is closely related to the first. I hinted at it above. It’s a lack of rigor and reproducibility. In an ideal world, a researcher could hold constant all irrelevant variables, implement a new technique, and then show some improvement over a range of baselines within some margin of significance. Sounds obvious? Well, if you happen to read a lot Deep Learning papers this sounds like it’s coming straight from a sci-fi movie.

In practice, as everyone re-implements techniques using different frameworks and pipelines, comparisons become meaningless. In almost every Deep Learning model implementation there exist a huge number “hidden variables” that can affect results. These include non-obvious model hyperparameters baked into the code, data shuffle seeds, variable initializers, and other things that are typically not mentioned in papers, but clearly affect final measurements. As you re-implement your LSTM, use a different framework, pre-process your own data, and write thousands of lines of code, how many confounding variables will you have created? My guess is that it’s in the hundreds to thousands. If you then show a 0.5% marginal improvement over some baseline models (with numbers usually taken from past papers and not even averaged across multiple runs) how can you ever prove causality? How do you know it’s not a result of some combination of confounding variables?

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Again, “just engineering” has the potential to solve this. The solution is pretty much equivalent to problem #1 (standard code, datasets, evaluation entities, etc), but so are the problems. In fact, it may not even be in the best interest of researchers to publish readable code. What if people found bugs in it and you need to retract your paper? Publishing code is risky, without a clear upside other than PR for whatever entity you work for.


Talking to Machines – The Rise of Conversational Interfaces and NLP

Depending on where you’ve lived you’re probably using at least one of  WhatsApp, WeChat, LINE, or Kakaotalk. At first glance, these apps look like communication tools. They allow you talk to your friends, family, coworkers and business partners. Sometimes all at once. But messenger apps are just the beginning of a much larger trend. Pretty soon, you’ll use such apps not to talk to your friends, but to talk to machines. Chat applications will become a new interface used to consume information and services. In fact, many companies have started to move into this direction. We’ll get to that later, but bear with me for a little while.

Let me clarify what I mean by interface first. For the purpose of this post, an interface is a layer of technology that facilitates the exchange of information and services between humans and machines. The perhaps most ubiquitous interface today is the web browser. We’ve invented several layers of complex technologies: HTTP, HTML, CSS, Javascript, and many others, to make the web browser interface work well.  The same is true for mobile apps, which share a lot of the same principles. Some people argue that a browsers and mobile apps are fundamentally different, but I disagree. There are subtle differences, but you’re essentially looking at a screen and clicking or touching stuff. They’re quite similar. In both cases, we use them to communicate with other people (chat apps), find information (Google), and consume services (Amazon).

By design, all interfaces impose constraints on their users. Some constraints are good – they force us to focus on the essential, making us more efficient. But Some are bottlenecks to our productivity, preventing us from doing what feels natural to us. Take search as an example. Humans are quite good at asking questions. We’ve been doing that for thousands of years. But can you have a conversation with Google to get the information you need? No, you can’t. (Okay, except for a tiny fraction of very simple and commonly asked questions). Instead, we’ve adjusted to the interface that Google is providing us. We’re entering keywords that we think Google will be able to efficiently utilize to give us what we want. Most of us are doing this unconsciously already because we’ve been conditioned to do so. But if watch your mom, who didn’t grow up with Google, you’ll see that she has a hard time wrapping her head around what “good” keywords are, and why she can’t find what she needs. She also can’t tell spam for non-spam. Keyword search doesn’t come naturally to her.

Now, there’s an alternative interface that we use to communicate, get information and consume services almost every day: Natural Language. Natural Language is nothing more than the transmission of information, or meaning, through a medium such as sound or script. That’s also why I won’t make a distinction between speech and text-based solution in this post. They both have their place, but they embody the same thing. Language is an interface that, right now, mostly facilitates communication between humans. But you could imagine using this same interface to facilitate communication between humans and machines. Instead of putting keywords into Google you’d ask a question, potentially have a quick conversation, and get a response. Instead of clicking through a dozen app screens you’d just say what food you want to order, or tell your car where you want to go.

Just like other interfaces, the natural language interface has constraints. It isn’t suited for everything. Obviously it’s bad for visual tasks – shopping clothes for example. It also doesn’t seem so great for discovery, like browsing items on Amazon without a clear goal. But you can certainly imagine use cases where natural language excels. Asking questions to get information or conveying tasks with a clear intention – ordering a specific food item, cab, or giving directions for example. Natural Language places a low cognitive load on our brains because we’re so used to it. Hence, it feels more natural and effortless than using an app or browser.

Of course, Facebook (M), Google (Now), Apple (Siri), Microsoft (Cortana) and Amazon (Echo) have been working on natural languages interface for a while now. They’re calling it personal assistants (PAs). However, there’s another group of companies who are attacking the same problem from a different angle, but we’ve been ignoring that part of their business.  Messenger apps. These platforms are uniquely positioned to enable communication with machines. The leader of the pack is probably WeChat, which allows you to order everything from food to taxis from within the app, using a messaging as the interface. Slack is moving into the same direction by encouraging the creation of various botsLINE also has a range of robotic accounts, ranging from translation services to erotic entertainment. 

Why do I think these two groups of companies are working on the same problem? PAs follow what I would call a top-down approach. They’re trying to solve a challenging problem with a long history in AI: Understanding your intentions and acting on them. They are trying to build general purpose software that is smart enough to also fulfill specific tasks. It’s a research area with lots of stuff left to figure out. Messenger apps follow a bottom-up approach. They are starting with simple bots (“Hi, translate this sentence for me”), solving very specific problems, and are gradually moving towards more sophisticated AI. Over time, these two approaches will converge.

When it comes to the adoption of natural language interfaces, messengers apps may actually have a leg up. Here’s why. Talking to Siri on a crowded train still feels awkward, even in SF. It’s also annoying because you have to open yet another app. However, many of us are spending our time in chat apps anyway, so it feels completely natural to just add conversation with a bot. Ditto for Slack. The transition for consumers seems so much smoother. As conversational interfaces penetrate our world, voice interfaces will start feeling just as natural as taking selfies (and hopefully more than selfie sticks). But another reason for why messenger apps are well positioned is data. These companies have collected huge amounts of conversational data that can be used to train better Natural Language models. Facebook has that too, both with their messenger and with WhatsApp. Microsoft doesn’t. Amazon doesn’t. Apple may or may not, depending on how secure iMessage really is.

If you’ve actually used any of the PAs you may be skeptical. Siri still barely understands what you want, and Facebook has put hordes of human workers behind M to get it to do anything useful. How will these things ever replace all the complex tasks we’re doing apps and browsers? That’s another reason for why the bottom-up approach seems promising. It yields immediate benefits and practical applications.

But I also believe that over the coming years we will see rapid improvements in conversational interfaces. There are clear enabling technologies for this trend: Natural Language Processing (NLP) and Deep Learning. Deep Learning techniques have led to breakthroughs in Computer Vision, and they are now penetrating natural language research. Whether or not Deep Learning itself will lead to breakthroughs in NLP is hard to say, but one thing is clearly happening: Many smart people who didn’t previously focus on NLP are now seeing the potential and are starting working on NLP problems, and we can surely expect something to come out of this.


Deep Learning Startups, Applications and Acquisitions – A Summary

Most major tech companies are use Deep Learning techniques in one way or another, and many have new initiatives on the way. Self-driving cars use Deep Learning to model their environment. Siri, Cortana and Google Now use it for speech recognition, Facebook for facial recognition, and Skype for real-time translation.

Naturally there are a lot of startups doing cool things in the space. I tried to do my best to categorize the companies below based on where their main focus seems to be. If you’re a Deep Learning company and I forgot you, please do let me know!

General / Infrastructure

Because Deep Learning is such a generic approach, some companies are focusing on creating infrastructure, algorithms, and tools that can be applied across a variety of domains.

DeepMind, which was acquired by Google for more than $500M in 2014, is working on general-purpose AI algorithms using a combination of Deep Learning and Reinforcement Learning. DeepMind is the company behind an algorithm that learns to play Atari games better than humans. It is a largely a research company and does not provide products for use by businesses or consumers.

MetaMind focuses on providing cutting-edge performance for image and natural language classification tasks. Richard Socher, the founder of Metamind, is very active in the academic community and teaches Stanford’s Deep Learning for Natural Language Processing class. The company offers a cloud service to train Deep Learning classifiers.

Nervana is the company behind the open source Python-based neon framework, a GPU-optimized library to build Deep Learning architectures. Nervana also provides a cloud services where it runs algorithms on proprietary hardware specifically designed for Deep Learning. Nervana raised $20.5M in a June 2015 round led by Data Collective.

Skymind is the company behind the Deeplearning4j framework. Deeplearning4j makes efficient use of GPUs and integrates with distributed systems such as Hadoop and Spark to scale to large data sets. Skymind sells an enterprise editions of its software together with training and support.

Ersatz Labs offers a cloud service to manage data and train Deep Learning models through a web interface (video) or an API. Pricing is based on minutes of GPU time used.

Computer Vision

It would be fair to say that Deep Learning gained most of its popularity through excellent performance on a variety of computer visions tasks: Recognizing objects in images, understanding scenes, and finding semantically similar images. Convolutional Neural Networks (CNNs), a popular type of Deep Learning architecture, are now considered the standard for most of the above. The rapid success of Deep Learning in Computer Vision has spurred a lot of startup activity.

Madbits was acquired by Twitter in 2014  before it got a chance to launch publicly. In its own words, it “built visual intelligence technology that automatically understands, organizes and extracts relevant information from raw media (images)”.

Perceptio was acquired by Apple In October 2015 while still in stealth mode. The website was shut down after the acquisition, but Perceptio seems to have been developing technology to run image-classifications algorithms on smartphones.

Lookflow was acquired by Yahoo/Flickr in October 2013. It’s unclear what exactly Lookflow was offering, but it was using Deep Learning algorithms for image classifications to help organize photos.

HyperVerge builds technology for a range of visual recognition tasks, including facial recognition, scene recognition, and image similarity search. HyperVerge is also working on a smart photo organization app called Silver. The company came out of  IIT and raised a $1M seed round from NEA in August 2015.

Deepomatic builds object recognition technology to identify products (e.g. shoes) in images, which can then be monetized through e-commerce links. It focuses on the fashion vertical and has raised $1.4M from Alven Capital (a French VC) and Angels in September 2015.

Descartes Labs focuses on understanding large datasets of images, such as satellite images. An example use case is tracking agriculture development across the country. Descartes Labs came out of the Los Alamos National Laboratory and has raised $3.3M of funding to date.

Clarifai uses CNNs to provide an API for image and video tagging. In April 2015, Clarifai raised a $10M Series A led by USV.

Tractable trains image classifiers to automate inspection tasks currently done by humans, for example detecting cracks on industrial pipes or inspecting cars.

Affectiva classifies emotional reactions based on facial images. It raised $12 million in Series C funding Horizon Ventures and Mary Meeker and Kleiner Perkins in 2012.

Alpaca is the company behind Labellio, a cloud service to build your own deep learning image classifier using a graphical interface.

Orbital Insight uses Deep Learning to analyze satellite imagery and understand global and national trends.

Natural Language

After the rapid success in Computer Vision, researchers were quick in adopting Deep Learning techniques for Natural Language Processing (NLP) tasks. In fact, the exact same algorithm that categorizes images can be used to analyze text. Since then, new Deep Learning techniques specifically for NLP have been developed, and are being applied to tasks such as categorizing text, finding content themes, analyzing sentiment, recognizing entities, or answering free-form questions.

AlchemyAPI was acquired by IBM (Watson group) in March 2015. It provides a range of  Natural Language Processing APIs, including Sentiment Analysis, Entity Extraction and Concept Tagging.  (AlchemyAPI also provides computer vision APIs, but their primary product seems to be language-related so I decided to put them in this category).

VocalIQ was working on a conversational voice-dialog system before being acquired by Apple in October 2015.

Idibon develops general-purpose NLP algorithms that can be applied to any language. Idibon’s public API does Sentiment Analysis for English, but more languages, and  support for Named Entity Recognition are coming soon. Idibon raised a $5.5M Series A led by Altpoin, Khosla, and Morningside Ventures in October 2014.

Indico provides a variety of Natural Language APIs based on Deep Learning models. APIs include Text Tagging, Sentiment Analysis, Language Prediction, and Political Alignment Prediction.

Semantria provides APIs and Excel plugins to perform various NLP tasks in 10+ languages. Pricing starts at $1,000/month for both Excel plugins and API access. Lexalytics, an on-premise NLP platform, acquired Semantria in 2014.

ParallelDots provides APIs for Semantic Proximity, Entity Extraction, Taxonomy Classification and Sentiment Analysis, as well as tools for social media analytics and automated timeline construction. 

Xyggy  is a search engine for all data types (text and non-text) represented by deep-learning vectors. With text for example, a search can be with keywords, snippets or entire documents to find documents with similar meaning.


Instead of focusing on general-purpose vision or language applications, some companies are applying Deep Learning techniques to specific verticals. My research surfaced mostly Healthcare companies, but It’s likely that many others are using Deep Learning without explicitly mentioning in on their website.

Enlitic applies deep learning techniques to medical diagnostics. By classifying x-rays, MRIs and CT scans, Enlitic can recognize early signs of cancer more accurately than humans. The company raised $3M from undisclosed investors in February 2015.

Quantified Skin uses selfies to track and analyze a person’s skin and recommends beneficial products and activities. The company raised a total of $280k in 3 rounds.

Deep Genomics uses Deep Learning to classify and interpret genetic variants. Its first product is SPIDEX, a dataset of genetic variants and their predicted effects.

StocksNeural uses Recurrent Neural Networks to predict stock prices based on historical time-series data.

Analytical Flavor Systems use Deep Learning to understand what people taste and optimize food and beverage production.

Artelnics builds open source libraries and graphical users interfaces to train Deep Learning models for a variety of industries.


Are there any Deep Learning startups I missed? I’d love to hear about them in the comments.


Reimagining Language Learning with NLP and Reinforcement Learning

The way we learn natural languages hasn’t really changed for decades. We now have beautiful apps like Duolingo and Spaced Repetition software like Anki, but I’m talking about our fundamental approach. We still follow pre-defined curricula, and do essentially random exercises. Learning isn’t personalized, and learning isn’t driven by data. And I think there’s a big opportunity to change that. With the unlimited supply of natural language data online, and with the advances in Natural Language Processing (NLP) techniques, shouldn’t we be able to do something smarter? Here’s what I’m thinking.

The foundation: Modeling Knowledge

At the heart of making learning more efficient is the ability to model a learner’s knowledge. Once you understand what a learner knows you can present her with material that’s most beneficial. Modeling knowledge in general is a difficult problem.  How would you quantify your knowledge about ancient Rome, English literature or mechanics? Knowledge in most disciplines is based on connecting disparate facts and then and reasoning about them in one way or another. Language learning is different, and it’s unique in that it’s quite simple. Comprehending a sentence doesn’t require higher level reasoning, and we can actually measure a learner’s knowledge by presenting her with the right challenges, such as sentence comprehension or completion.

We also need to model language itself. In order to present a learner with a sentence she can comprehend we must know which knowledge (vocabulary, grammar, etc) that sentence depends on. In a way that’s what courses do “manually”.  They present you with a predefined sequence of material that builds on top of each other. I believe we can do this automatically. NLP techniques are sufficiently sophisticated that we should be able to figure out the knowledge dependencies of a text. And that would open up a whole new world of possibilities.

A mathematical formulation

To make things concrete, let’s actually try to define the above mathematically. What follows is invariably an oversimplification of language learning, but I think it’s a useful enough model to do something interesting with. Let’s assume a learner’s language knowledge can be quantified by how well she knows vocabulary and grammar items. I’m not saying this is the right, or the only definition, but it’s something has worked quite well in practice. It’s what most courses and textbooks do.

A learner’s knowledge is defined by a state s, which captures our belief about what the learner knows. For example, s could be a sparse vector of real numbers where each element (e.g. 0.73) is a score quantifying how well a learner knows a word or grammar rule. The score could be calculated based on the learner’s performance on reading/listening comprehension and writing/speaking production tasks. Note that s models our belief about the learner’s knowledge, not necessary the actual state of the world. Thus it would probably be a good idea to also include uncertainty (in the form of confidence bounds or distributions) in the representation above. But it’s easier to think about s as just a vector of scores.

We can perform actions a \in A to modify a learner’s knowledge. Actions could include vocabulary reviews, sentence comprehension tasks, or grammar exercises. Just think about what textbooks do. All of these actions have an effect on s.  They could increase or decrease the scores based on how well the learner did (or could change the uncertainty about our beliefs). In other words, if a learner is in state s_t at time t, then an action a \in A will transition her to a new state s_{t+1}. The number of possible states is obviously huge, or infinite.

This now starts to look a bit like a Markov Decision Process (MDP), except that we don’t have uncertainty in our state transition, and that we haven’t defined a reward function.

Learning towards a specific goal

Most approaches ignore the fact that students have different motivations for learning a language. That’s clearly a mistake. The knowledge required to understand your favorite TV drama is different from the knowledge required to comprehend scientific journals. Obviously there is a lot of overlap, but taking a class focused on daily conversation probably isn’t the fastest way towards reading academic literature. With the ability to model knowledge on a fine-grained level we can have truly personalized learning.

Let’s assume the learner’s goal is to understand a certain text, an online article or Youtube video for example. Because we know the knowledge dependencies of that text we know which target states s^t would allow the learner to comprehend it (with high probability at least). Our goal is to find a policy \pi(s_t) that tells us which actions to take at any given point in time in order to reach some target state as quickly as possible. The policy tells us the stochastically optimal path towards a learner’s goal. In an MDP, the policy is defined as maximizing the sum of rewards from some reward function R_a(s, s'), and by defining that function in the right way we can solve the problem of finding an optimal policy using Reinforcement Learning techniques.

This task is challenging due to several reasons. The state space is infinite and actions have stochastic results. We can never explicitly model the whole space. We may also need to trade immediate rewards for long term rewards. For example, instead of learning a complicated term that frequently appears in the target text it may be better to learn a common word that has a low frequency in the target text but makes more actions available to the learner in the future. Luckily, all these are well-known problems that have been solved in one way or another.

Picking the right actions

A key challenge in language learning is to present the learner with material that is neither too difficult nor too easy, or the learner will become frustrated or bored, respectively. In text comprehension there is research that shows that one unknown word for about 50 known words is a good ratio to encourage learning. Learning vocabulary from context is generally more effective than rode memorization because it forces the brain to make connections to things you already know. If we can accurately model the knowledge of a learner and the knowledge dependencies of text, then this task becomes trivial. We could find articles, social media posts or other content that are just right for the learner’s current level and create actions based on them. And of course, by presenting such material to the learner we would refine our model of what the learner actually knows. In order words, the set of actions available at a state s_t should be limited  to those actions that are appropriate for a learner at that stage.

Data Network Effects

The more actions a learner performs the more accurately we will be able model his knowledge, and the more confident we can be in presenting him with the right actions. But that’s not all. As more learners are performing actions we can become more certain about how actions affect a learner’s state, essentially answering the question: Which material is most effective for a learner with a certain background knowledge and goal? This not only allows for making optimal recommendations about what a learner should do next, but may even provide insights about how people learn in general.

These are just some examples of things we can do with a more analytical approach to language learning, but it’s already pretty exciting.