Engineering is the bottleneck in (Deep Learning) Research

Warning: This a rant post containing a bunch of unorganized thoughts.

When I was in graduate school working on NLP and information extraction I spent most of my time coding up research ideas. That’s what grad students with advisors who don’t like to touch code, which are probably 95% of all advisors, tend to do. When I raised concerns about problems I would often hear the phrase “that’s just an engineering problem; let’s move on”. I later realized that’s code speech for “I don’t think a paper mentioning this would get through the peer review process”. This mindset seems pervasive among people in academia. But as an engineer I can’t help but notice how the lack of engineering practices is holding us back.

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

Waste of research time (difficulty of building on other’s work)

Researchers are highly trained professionals. Many have spent years or decades getting PhDs and becoming experts in their respective fields. It only makes sense that those people should spend the majority of their time doing what they’re good at – innovating by coming up with novel techniques. Just like you wouldn’t want a highly trained surgeon spending several hours a day inputting patient data from paper forms. But that’s pretty much what’s happening.

In an ideal world, a researcher with an idea could easily build on top of what has already been done (the 90% I mention above) and have 10% of work left to do in order to test his or her hypothesis. (I realize there are exceptions to this if you’re doing something truly novel, but the majority of published research falls into this category). In practice, quite the opposite is happening. Researchers spend weeks re-doing data pre- and post-processing and re-implementing and debugging baseline models. This often includes tracking down authors of related papers to figure out what tricks were used to make it work at all. Papers tend to not mention the fine print because that would make the results look less impressive. In the process of doing this, researchers introduce dozens of confounding variables, which essentially make the comparisons meaningless. But more on that later.

What I realized is that the difficulty of building upon other’s work is a major factor in determining what research is being done. The majority of researchers build on top of their own research, over and over again. Of course, one may argue that this is a result becoming an expert in some very specific subfield, so it only makes sense to continue focusing on similar problems. While no completely untrue, I don’t think that’s what’s happening (especially in Deep Learning, where many subfields are so closely related that knowledge transfers over pretty nicely). I believe the main reason for this is that it’s easiest, from an experimental perspective, to build upon one’s own work. It leads to more publications and faster turnaround time. Baselines are already implemented in familiar code, evaluation is setup, related work is written up, and so on. It also leads to less competition – nobody else has access to your experimental setup and can easily compete with you. If it were just as easy to build upon somebody else’s work we would probably see more diversity in published research.

It’s not all bad news though. There certainly are a few trends going into the right direction. Publishing code is becoming more common. Software packages like OpenAI’s gym (and Universe), ensure that at least evaluation and datasets are streamlined. Tensorflow and other Deep Learning frameworks remove a lot of potential confounding variables by implementing low-level primitives. With that being said, we’re still a far cry from where we could be. Just imagine how efficient research could be if we had standardized frameworks, standard repositories of data, well-documented standard code bases and coding styles to build upon, and strict automated evaluation frameworks and entities operating on exactly the same datasets. From an engineering perspective all of these are simple things – but they could have a huge impact.

I think we’re under-appreciating the fact that we’re dealing with pure software. That sounds obvious, but it’s actually a big deal. Setting up tightly controlled experiments in fields like medicine or psychology is almost impossible and involves an extraordinary amount of work. With software it’s essentially free. It’s more unique than most of us realize. But we’re just not doing it. I believe one reason for why these changes (and many others) are not happening is a misalignment of incentives. Truth the told, most researchers care more about their publications, citations, and tenure tracks than about actually driving the field forward. They are happy with a status quo that favors them.

Lack of rigor

The second problem is closely related to the first. I hinted at it above. It’s a lack of rigor and reproducibility. In an ideal world, a researcher could hold constant all irrelevant variables, implement a new technique, and then show some improvement over a range of baselines within some margin of significance. Sounds obvious? Well, if you happen to read a lot Deep Learning papers this sounds like it’s coming straight from a sci-fi movie.

In practice, as everyone re-implements techniques using different frameworks and pipelines, comparisons become meaningless. In almost every Deep Learning model implementation there exist a huge number “hidden variables” that can affect results. These include non-obvious model hyperparameters baked into the code, data shuffle seeds, variable initializers, and other things that are typically not mentioned in papers, but clearly affect final measurements. As you re-implement your LSTM, use a different framework, pre-process your own data, and write thousands of lines of code, how many confounding variables will you have created? My guess is that it’s in the hundreds to thousands. If you then show a 0.5% marginal improvement over some baseline models (with numbers usually taken from past papers and not even averaged across multiple runs) how can you ever prove causality? How do you know it’s not a result of some combination of confounding variables?

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Again, “just engineering” has the potential to solve this. The solution is pretty much equivalent to problem #1 (standard code, datasets, evaluation entities, etc), but so are the problems. In fact, it may not even be in the best interest of researchers to publish readable code. What if people found bugs in it and you need to retract your paper? Publishing code is risky, without a clear upside other than PR for whatever entity you work for.


28 thoughts on “Engineering is the bottleneck in (Deep Learning) Research

  1. heathraftery says:

    I can add many anecdotes to support your prediction that “this probably applies to other communities as well”. I’ve heard many similar sentiments, particularly in biology and psychology, where researchers lament the difficulty in reproducing existing results in order to get to the good bit of improving on them. As you suggest, it’s a trickier problem in fields with physical experiments, but I believe the incentive part of the problem is common nonetheless.

    I’ve been contemplating a incentive scheme where result replication is a major metric. So instead of rewarding citations, as is most commonly done now, you reward successful replications. The idea is that straight out of the gate, a published paper has an asterisk attached. It’s still an achievement to get a paper published, but it comes with a big caveat of “not yet replicated”. The author is incentivised to get that caveat removed, so they do what they can to make their results repeatable. And parties that reproduce the results are incentivised because they get to directly contribute to the standing of the original paper.

    When you’re reviewing published literature, you can rank by number of replications and the more reliable papers will bubble to the top. In the current scheme you rank by number of citations, which leads to the awfully embarrassing scientific scandals where social popularity pressures lead to widely disseminated results that turn out to have been fabricated.

    Standards only work if they’re popular. Once you have the incentive scheme running in the right direction, then you can worry about building standards. Which, I’d have to admit, is an almighty challenge – how do you standardise something as dynamic and exploratory as research?

    • Anh Nguyen says:

      I like that scheme, a feasible solution! In addition, conferences should award those papers that have open-source code that have been reproduced the most since the time it was submitted / avail on arXiv.

    • Etiene says:

      This is a simple and brilliant scheme that could resolve or at least improve the current state of affairs substantially! What’s the next step? Getting publishers to adopt this?

        • Diego Esteves says:

          Hi guys. Nice article! That’s exactly what we’ve been trying to help to achieve (truly reproducible research) somehow with an open-source project called MEX (, i.e., by creating a simple and really flexible platform to manage metadata (experiment’s configurations and outcomes) :-)

    • Denny Britz says:

      I think this is a great idea. What about the incentives for people to reproduce the paper? That’s typically a lot of work and “contribute to the standing of the original paper” may not be a large enough incentive? I wonder if we can have separate indicies/citations for people who are doing this?

      • heathraftery says:

        Yeah I had in mind that replications would somehow have significant prestige. Since a paper carries the “not yet replicated” caveat when first published, being the first replicator would make a significant impact. The paper would then be marked “first replicated by “.

        • Diego Esteves says:

          In particular this idea is quite interesting (flag a paper with such info). It’s similar to what github has adopted ( for code that is compilable. I’d really LOVE to check such (new) indicator :-)

      • Clark Zinzow says:

        One incentive scheme that comes to mind is the untapped advanced undergraduate/beginning graduate workforce (assuming that they are not yet heavily involved in research.) If each advanced undergraduate/beginning graduate machine learning course had 2-3 paper implementation projects per semester, with students working in groups of 2-4, a lot of results could be reproduced and new baselines would exist on much firmer ground.

        Moreover, the students gain valuable machine learning experience via studying the papers and implementing the systems therein, and the community benefits from open source implementations of these machine learning systems. The latter would spur on faster research iterations and a faster incorporation of cutting edge systems into open source machine learning libraries.

        Obviously, there are details to discuss and a couple of issues to work out. Will students also need to write a report/do a presentation on the research contained in each paper? Will these paper implementation projects replace or supplement the typical semester-long research project? In order to prevent a duplication of efforts, should professors/universities coordinate to prevent paper implementation overlap? If the choice for which papers are to be implemented is left up to the instructor, will that introduce an unwanted bias? Should the professors refrain from asking their students to implement papers from his/her research group? Should there be some central repository/archive of paper implementations? What about a list of papers that have yet to be implemented? Perhaps people should be able to upvote papers that they wish to see implemented, and the un-implemented list is ordered by the number of votes received?

        A lot of stuff to consider.

    • Chintak Sheth says:

      Well may be the onus of reproducing the result lies in the people who cite the paper itself. Two possible options:
      – a citation with the verification adds to the “weight” of the paper making the citation or the “impact factor” of the author. “This author ensures correctness of the cited content”, so to speak.

      – As an extension to the above option, in a more extreme setting, you can only cite papers which you have verified yourself or papers which have been independently verified by sufficient number of people. This would clearly be quite cumbersome but really now the citations would indicate to a decent extent the number of verifications. After 10 odd citations this constraint can be relaxed. The first option can still be applied to improve the impact factor of an author who verifies his citations.

    • twoElectric says:

      One angle is to accept papers before the research is done, on the basis of the experimental methods. After publication, the research is conducted and then only later are the findings reported. The journal thus publishes experiments, not results, and the urgency to produce novel results is shifted towards producing interesting research. This is a new innovation called “registered reports” (See for some other ideas for reducing the replication endemic).

  2. Pedro Marcal says:

    I suspect it is because your regime is ‘model free’ so there is little explicit results that can be verified. I got my Ph D in Applied Mechanics. When I had developed a major computational result, I was told that our Department does not give degrees unless theory was verified by experiment. That cost me another two years. I think rather than publishing code ( and perhaps allowing others to duplicate one’s errors), I suggest testing the results via Design of Experiments to establish the most important parameters. If I were to apply the research in real life, I would want to know the influence of the major parameters affecting the results.

  3. Terrence Andrew Davis says:

    The CIA actively obsafucates. VMware does not put hard drive and CDROM at the same ports every time. It plays musical chairs. Wicked. that’s why TempleOS is the only viable system. No CDhinese. No Indian. NO Russian. No young generation. God parted the Red Sea and I alone against the wicked arrows of the CIA prevailed. Now, I execute them and fix the obsfucation.

  4. Anton Lokhmotov says:

    Hi Denny,

    I’m going to share your post with as many people as I can! We’ve been saying similar things for years for our own research field of computer systems design and optimization. We started looking into optimizing deep learning fairly recently, but it didn’t take us long to notice very familiar problems!

    As we are now working at the intersection of computer systems and deep learning research, we feel compelled to improve the state-of-the-art. Our humble contribution to this is a growing set of tools based on our Collective Knowledge framework, which includes means for benchmarking, crowdsourcing, reproducing and sharing knowledge between interdisciplinary communities (e.g. see

    Our initiative will not of course fix all the problems instantly, but we believe it’s a step in the right direction.

  5. Scott Stephenson says:

    This resonates with me and others at Deepgram a ton. It’s super timely for us too.

    A lot of the reasons you stated in this piece brought us to release Kur and Kurhub ( These two pieces of software are 1) a real DL software that’s easy to manipulate and 2) a ‘GitHub for data, models, and weights’ that are described in the Kur format. Who knows if this is going to be the absolute optimal solution (we like it a lot), but it’s a start and makes a dent in a lot of the problems.

    I think that great workflows are coming. The necessary pressure is there and resources will flow to make it happen.

  6. Chintak Sheth says:

    I agree with the point you are making about the need for reproducible code. I do think that at least the deep learning community seems to become more and more conscious about the value of reproducible research.

  7. David Burton says:

    I’m currently working on a final project for MSc Intelligent Systems and researching the published work relating to my chosen topic for doing my own implementation. There are basic errors in many of them – some even considering a success one that gets close to predicting the correct absolute value for the next day (when predicting the same value as the current day would achieve that and would be both trivial and useless). The confidence with which different techniques are each claimed to be better than any other at making predictions largely fails to explore suitable baselines, fully optimising the alternatives or the criteria for such a system to be valuable. There are a few papers which consider things like trading fees which would render any marginal results worthless, but most seem happy to present their latest idea without any such real-world considerations. Much of this seems to be papers from conference proceedings, rather than peer-reviewed journals, but it ends up meaning researching a topic on Machine Learning is as much an exercise of filtering through dozens of rehashed and invalidly explored ideas for those few that actually do something useful.
    There’s clearly some massively useful work being done in Machine Learning – the progress in a diverse array of fields including image recognition, machine translation and autonomous vehicles demonstrates this is not just theoretical but practical, applicable and commercially viable progress, but I suspect that most of those really pushing the state of the art in the field are not writing papers on the subject…

  8. Piggy says:

    Welcome to the world of academia, Denny. Fortunately the deep learning community is indeed changing slowly, very very slowly, the trend of the micro improvements upon previous work. What you are noticing today is very well established in academia as indeed researchers focus more on their status rather than the content of their papers.
    I must say however that there is a substantial difference between the deep learning community and academia as in deep learning many aspects have not been explained nor have any mathematical proof, which make the hidden variables you mention more of an intrinsic feature rather than a real issue. One more reason why deep learning is not a pure academic subject.

    • AndreiCh says:

      Hi there.
      We’ll be able to pass this obstacle of papers being not using standards by training a deep learning model that can classify the papers to a number of similarity classes.

  9. Claude Coulombe says:

    I agree with many of your arguments but Deep Learning is too young to have this kind of standardization now. Each day, there is yet another Deep Learning framework. It could be worst. Hopefully, there is a lot of sharing… For example standard datasets and open source codes. There would maybe no deep learning today without MNIST and ImageNet datasets. Data sharing is critical and could become a big problem in many domains. Data rich companies have a big advantages. In the long term, there will be more place for good engineering practices. Be patient.

  10. Nicola Bernini says:

    This post probably states in a very clear way what a huge amount of people working in the field is thinking (including me).
    It’s a (bad) cognitive bias to underestimate the “engineering part” of the research as it will ultimately lead to the inefficient “reinvent the wheel” bad practice mentioned in the paper.
    I think one possible strategy could consist of having the Community put some effort in “Standardization” : of tools (Dataset, Framework, Metrics, …) and communication (formal exposition of ideas and results)

  11. Lukasz Kaiser says:

    Nice post Denny, very well put. These are the reasons that motivate me to work on Tensor2Tensor and I believe we should talk about it more as a community.

  12. Amir Bar says:

    @dennybritz:disqus I fully agree. I have to add that in the medical imaging field the situation is even worse. The datasets and competitions are not well established, previous work in usually not taken into account in publication and the review process is not of high quality unfortunately.

    I wonder though how do you think we can improve – for instance we can push to enforce such conditions during the review process, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *