Engineering is the bottleneck in (Deep Learning) Research

Warning: This a rant post containing a bunch of unorganized thoughts.

When I was in graduate school working on NLP and information extraction I spent most of my time coding up research ideas. That’s what grad students with advisors who don’t like to touch code, which are probably 95% of all advisors, tend to do. When I raised concerns about problems I would often hear the phrase “that’s just an engineering problem; let’s move on”. I later realized that’s code speech for “I don’t think a paper mentioning this would get through the peer review process”. This mindset seems pervasive among people in academia. But as an engineer I can’t help but notice how the lack of engineering practices is holding us back.

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

Waste of research time (difficulty of building on other’s work)

Researchers are highly trained professionals. Many have spent years or decades getting PhDs and becoming experts in their respective fields. It only makes sense that those people should spend the majority of their time doing what they’re good at – innovating by coming up with novel techniques. Just like you wouldn’t want a highly trained surgeon spending several hours a day inputting patient data from paper forms. But that’s pretty much what’s happening.

In an ideal world, a researcher with an idea could easily build on top of what has already been done (the 90% I mention above) and have 10% of work left to do in order to test his or her hypothesis. (I realize there are exceptions to this if you’re doing something truly novel, but the majority of published research falls into this category). In practice, quite the opposite is happening. Researchers spend weeks re-doing data pre- and post-processing and re-implementing and debugging baseline models. This often includes tracking down authors of related papers to figure out what tricks were used to make it work at all. Papers tend to not mention the fine print because that would make the results look less impressive. In the process of doing this, researchers introduce dozens of confounding variables, which essentially make the comparisons meaningless. But more on that later.

What I realized is that the difficulty of building upon other’s work is a major factor in determining what research is being done. The majority of researchers build on top of their own research, over and over again. Of course, one may argue that this is a result becoming an expert in some very specific subfield, so it only makes sense to continue focusing on similar problems. While no completely untrue, I don’t think that’s what’s happening (especially in Deep Learning, where many subfields are so closely related that knowledge transfers over pretty nicely). I believe the main reason for this is that it’s easiest, from an experimental perspective, to build upon one’s own work. It leads to more publications and faster turnaround time. Baselines are already implemented in familiar code, evaluation is setup, related work is written up, and so on. It also leads to less competition – nobody else has access to your experimental setup and can easily compete with you. If it were just as easy to build upon somebody else’s work we would probably see more diversity in published research.

It’s not all bad news though. There certainly are a few trends going into the right direction. Publishing code is becoming more common. Software packages like OpenAI’s gym (and Universe), ensure that at least evaluation and datasets are streamlined. Tensorflow and other Deep Learning frameworks remove a lot of potential confounding variables by implementing low-level primitives. With that being said, we’re still a far cry from where we could be. Just imagine how efficient research could be if we had standardized frameworks, standard repositories of data, well-documented standard code bases and coding styles to build upon, and strict automated evaluation frameworks and entities operating on exactly the same datasets. From an engineering perspective all of these are simple things – but they could have a huge impact.

I think we’re under-appreciating the fact that we’re dealing with pure software. That sounds obvious, but it’s actually a big deal. Setting up tightly controlled experiments in fields like medicine or psychology is almost impossible and involves an extraordinary amount of work. With software it’s essentially free. It’s more unique than most of us realize. But we’re just not doing it. I believe one reason for why these changes (and many others) are not happening is a misalignment of incentives. Truth the told, most researchers care more about their publications, citations, and tenure tracks than about actually driving the field forward. They are happy with a status quo that favors them.

Lack of rigor

The second problem is closely related to the first. I hinted at it above. It’s a lack of rigor and reproducibility. In an ideal world, a researcher could hold constant all irrelevant variables, implement a new technique, and then show some improvement over a range of baselines within some margin of significance. Sounds obvious? Well, if you happen to read a lot Deep Learning papers this sounds like it’s coming straight from a sci-fi movie.

In practice, as everyone re-implements techniques using different frameworks and pipelines, comparisons become meaningless. In almost every Deep Learning model implementation there exist a huge number “hidden variables” that can affect results. These include non-obvious model hyperparameters baked into the code, data shuffle seeds, variable initializers, and other things that are typically not mentioned in papers, but clearly affect final measurements. As you re-implement your LSTM, use a different framework, pre-process your own data, and write thousands of lines of code, how many confounding variables will you have created? My guess is that it’s in the hundreds to thousands. If you then show a 0.5% marginal improvement over some baseline models (with numbers usually taken from past papers and not even averaged across multiple runs) how can you ever prove causality? How do you know it’s not a result of some combination of confounding variables?

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Again, “just engineering” has the potential to solve this. The solution is pretty much equivalent to problem #1 (standard code, datasets, evaluation entities, etc), but so are the problems. In fact, it may not even be in the best interest of researchers to publish readable code. What if people found bugs in it and you need to retract your paper? Publishing code is risky, without a clear upside other than PR for whatever entity you work for.

/ END OF RANT

Talking to Machines – The Rise of Conversational Interfaces and NLP

Depending on where you’ve lived you’re probably using at least one of  WhatsApp, WeChat, LINE, or Kakaotalk. At first glance, these apps look like communication tools. They allow you talk to your friends, family, coworkers and business partners. Sometimes all at once. But messenger apps are just the beginning of a much larger trend. Pretty soon, you’ll use such apps not to talk to your friends, but to talk to machines. Chat applications will become a new interface used to consume information and services. In fact, many companies have started to move into this direction. We’ll get to that later, but bear with me for a little while.

Let me clarify what I mean by interface first. For the purpose of this post, an interface is a layer of technology that facilitates the exchange of information and services between humans and machines. The perhaps most ubiquitous interface today is the web browser. We’ve invented several layers of complex technologies: HTTP, HTML, CSS, Javascript, and many others, to make the web browser interface work well.  The same is true for mobile apps, which share a lot of the same principles. Some people argue that a browsers and mobile apps are fundamentally different, but I disagree. There are subtle differences, but you’re essentially looking at a screen and clicking or touching stuff. They’re quite similar. In both cases, we use them to communicate with other people (chat apps), find information (Google), and consume services (Amazon).

By design, all interfaces impose constraints on their users. Some constraints are good – they force us to focus on the essential, making us more efficient. But Some are bottlenecks to our productivity, preventing us from doing what feels natural to us. Take search as an example. Humans are quite good at asking questions. We’ve been doing that for thousands of years. But can you have a conversation with Google to get the information you need? No, you can’t. (Okay, except for a tiny fraction of very simple and commonly asked questions). Instead, we’ve adjusted to the interface that Google is providing us. We’re entering keywords that we think Google will be able to efficiently utilize to give us what we want. Most of us are doing this unconsciously already because we’ve been conditioned to do so. But if watch your mom, who didn’t grow up with Google, you’ll see that she has a hard time wrapping her head around what “good” keywords are, and why she can’t find what she needs. She also can’t tell spam for non-spam. Keyword search doesn’t come naturally to her.

Now, there’s an alternative interface that we use to communicate, get information and consume services almost every day: Natural Language. Natural Language is nothing more than the transmission of information, or meaning, through a medium such as sound or script. That’s also why I won’t make a distinction between speech and text-based solution in this post. They both have their place, but they embody the same thing. Language is an interface that, right now, mostly facilitates communication between humans. But you could imagine using this same interface to facilitate communication between humans and machines. Instead of putting keywords into Google you’d ask a question, potentially have a quick conversation, and get a response. Instead of clicking through a dozen app screens you’d just say what food you want to order, or tell your car where you want to go.

Just like other interfaces, the natural language interface has constraints. It isn’t suited for everything. Obviously it’s bad for visual tasks – shopping clothes for example. It also doesn’t seem so great for discovery, like browsing items on Amazon without a clear goal. But you can certainly imagine use cases where natural language excels. Asking questions to get information or conveying tasks with a clear intention – ordering a specific food item, cab, or giving directions for example. Natural Language places a low cognitive load on our brains because we’re so used to it. Hence, it feels more natural and effortless than using an app or browser.

Of course, Facebook (M), Google (Now), Apple (Siri), Microsoft (Cortana) and Amazon (Echo) have been working on natural languages interface for a while now. They’re calling it personal assistants (PAs). However, there’s another group of companies who are attacking the same problem from a different angle, but we’ve been ignoring that part of their business.  Messenger apps. These platforms are uniquely positioned to enable communication with machines. The leader of the pack is probably WeChat, which allows you to order everything from food to taxis from within the app, using a messaging as the interface. Slack is moving into the same direction by encouraging the creation of various botsLINE also has a range of robotic accounts, ranging from translation services to erotic entertainment. 

Why do I think these two groups of companies are working on the same problem? PAs follow what I would call a top-down approach. They’re trying to solve a challenging problem with a long history in AI: Understanding your intentions and acting on them. They are trying to build general purpose software that is smart enough to also fulfill specific tasks. It’s a research area with lots of stuff left to figure out. Messenger apps follow a bottom-up approach. They are starting with simple bots (“Hi, translate this sentence for me”), solving very specific problems, and are gradually moving towards more sophisticated AI. Over time, these two approaches will converge.

When it comes to the adoption of natural language interfaces, messengers apps may actually have a leg up. Here’s why. Talking to Siri on a crowded train still feels awkward, even in SF. It’s also annoying because you have to open yet another app. However, many of us are spending our time in chat apps anyway, so it feels completely natural to just add conversation with a bot. Ditto for Slack. The transition for consumers seems so much smoother. As conversational interfaces penetrate our world, voice interfaces will start feeling just as natural as taking selfies (and hopefully more than selfie sticks). But another reason for why messenger apps are well positioned is data. These companies have collected huge amounts of conversational data that can be used to train better Natural Language models. Facebook has that too, both with their messenger and with WhatsApp. Microsoft doesn’t. Amazon doesn’t. Apple may or may not, depending on how secure iMessage really is.

If you’ve actually used any of the PAs you may be skeptical. Siri still barely understands what you want, and Facebook has put hordes of human workers behind M to get it to do anything useful. How will these things ever replace all the complex tasks we’re doing apps and browsers? That’s another reason for why the bottom-up approach seems promising. It yields immediate benefits and practical applications.

But I also believe that over the coming years we will see rapid improvements in conversational interfaces. There are clear enabling technologies for this trend: Natural Language Processing (NLP) and Deep Learning. Deep Learning techniques have led to breakthroughs in Computer Vision, and they are now penetrating natural language research. Whether or not Deep Learning itself will lead to breakthroughs in NLP is hard to say, but one thing is clearly happening: Many smart people who didn’t previously focus on NLP are now seeing the potential and are starting working on NLP problems, and we can surely expect something to come out of this.

 

Why I’ll never go to an office again

Over the past few years I’ve had the chance to work with many distributed teams. Most of them were startups, but I’ve also done remote projects for larger corporations. I honestly have never been happier with my work-life balance and I firmly believe that distributed teams are the future of engineering organizations. I can no longer imagine living and working any other way. In this post I want to discuss what about remote work had the biggest impact on my personal happiness as well as some of the things I’ve learned along the way.

Increased productivity

Working remotely has helped me to better understand my body. Previously I didn’t have much choice in setting my hours. By tracking my productivity over the course of the day I found that I am most productive in the morning (6-11am) and early evening (4-6pm). I essentially can’t get any intellectually demanding work done during noon or late at night. Being in the office from 11-3 is a waste of time for me and for the company I work for. I don’t produce any output and I feel horrible. Setting my own schedule means that I can take a break at noon, get lunch, work out or go for a run, take a nap, and run some errands. Afterwards I feel happy, re-energized and ready to get stuff done again. I know plenty of people who love to work late at night. So why not let them?

While some centralized teams offer flexible hours with they usually come with limitations. If you have a long commute (next point) what can you really do within 4 hours of free time? What if there’s no gym or track around your office? What if you want to buy groceries for your home? What about the peer pressure of eating and hanging out with co-workers? If you’re the only person in the office at midnight what’s the reason to being there at all?

No commute

Studies have founded that a long commute has negative effects on life dissatisfaction. The longer the commute the more pronounced these effects are. Some people succeed in setting up a home office but for many (including me) it’s rather difficult to get serious work done at home. I typically work from a co-working space pretty close to my house. So it isn’t quite true that there is no commute, but usually the commute is much shorter and doesn’t feel like a commute. For example, I walk to my favorite coffee shop in the morning and enjoy getting some fresh air. That’s pretty different from a 1-hour drive and being stuck in traffic.

Reducing the commute has been huge for me. You may have gotten used to it already, but don’t underestimate the stress that comes from commuting. I no longer worry about when exactly to leave home, whether or not there’s traffic, or when I will arrive at my office. As result I feel less stressed and more productive throughout the day.

Flexibility and ability to travel

Centralized teams have setup a structure of synchronous communication where not being in the office at usual times breaks processes like scheduled meetings. Distributed teams typically designed their communication structure to be asynchronous and less reliant and people being available at the same time. This means that occasionally changing the times you work isn’t such a big deal. When I have an appointment, errands to run, or want to show a friend around town I simply change my hours to make room for it. I can also satisfy my urge to travel to other places or countries. My favorite places to live in are currently Japan and Thailand, and I’ve been traveling back and forth between them.

Office politics and gossip

This point isn’t really a result of working remotely but rather of how distributed teams or typically setup and run. Humans are social animals and as the team grows it’s inevitable that office politics and social hierarchies develop. I am not talking about professional relations, these are usually pretty clear, but interpersonal ones. Things that are seemingly unrelated to work, such as forgetting someone’s birthday, deciding what to get for lunch, or not going out for beers with the others inevitably spill over to professional interactions (sometimes just subconsciously). While some people benefit from these things, they can become a huge factor of stress and work dissatisfaction for others.

Distributed teams don’t avoid office politics completely, but almost. Restricting in-person social interaction and focusing on trackable results and metrics leads to significantly less politics and stress. Separating work and personal life is much easier when working remotely.

What about the employer’s perspective?

To make distributed teams the norm we need to make it lucrative for both sides, the employee and the employer. This post is written from the employee’s perspective and that’s arguably the easier side to convince. The benefits for the employee are pretty well understood and there are enough people who want to work remotely.

However, making distributed teams more productive than centralized ones is hard. Asynchronous communication is probably the main reason for this. Most distributed teams I’ve worked with were probably less effective than if everyone had been in the same space. But I don’t think it has to be this way. With the right processes in place I believe that we can make distributed teams work for both employees and employers. I’ll discuss some of the things I’ve learned from the employer’s side in a future post.

I’d love to hear about your experiences with distributed teams.