Hi, this is DJ. I'm a machine learning engineer at True Theta. And in this video, we're going to ask the question, why is reinforcement learning applied so rarely in industry? It's a good question because reinforcement learning is really attractive, but except in truly rare cases, it's really not the right choice for industry applications. I mean, I understand why people love the topic. I'm one of them. The math is beautiful and it tackles a problem statement so general it can describe almost any goal oriented task. And that's why I spent such a long time putting together my series on reinforcement learning. But since then, I've gone to work with companies on applied RL like for marketing, e-commerce, and quantitative finance. And I find myself recommending and building with much more traditional techniques. I mean, I'm making this video almost out of guilt. And I feel like I have to share this practical reality. I'll start by giving more of the story. After my RL series, a handful of people reached out for help with applying RL. I was at Lyft at the time, so I wasn't available, but it was pretty exciting. For example, one reachout came from someone at the Department of Defense, which was quite interesting. The others were from small or mid-size tech companies. So, this had my attention. Also, as I mentioned in the series, Lyft had deployed into production a successful application of RL. So, I was convinced this was the next wave of applied innovation. In fact, that was enough for me to take the leap and start an applied ML company to work on projects like this and others. My first step was to get a better understanding of tried and trueue best practices for real applications of RL. So I reached out to a bunch of people with applied RL experience. I ended up speaking with engineers and scientists from Expedia, Seammen's Energy, Amazon, Pinterest, Microsoft, Meta, ThoughtWorks, and a handful of startups. What I discovered is that in most of these successful applications, a pretty generous definition of reinforcement learning was used. This wasn't totally surprising since it's good marketing to say you do RL and there's no clear boundary between RL and not RL. In fact, Ben Rect has a good article on this problem describing how the most general definition of RL is learning how to map situations to actions, which includes virtually everything. And the least general is specifically connected to the techniques favored by the RL literature. This means the label reinforcement learning is found on a lot of applications and those applications aren't necessarily what you expect. For example, in one serious logistical application of RL, the team used a huge, carefully designed simulation of what next week's shipping and transportation outcomes will be. They feed a bunch of parameters specifying how they'll operate next week, and the simulation spits back a huge collection of outcomes, which they can evaluate by looking at things like expected delays or costs. The goal is to pick the parameters that make the simulations look good. Now, is this RL? Well, since the parameters could be interpreted as a policy and the simulation is the environment, you could call it RL or by its classical name, simulation optimization, and impress a lot fewer people. To me, this just feels like vanilla stoastic optimization because there's no critical notion of time, state, or sequence of dependent actions. I also learned if something is squarely RL and is commercially successful, they're often in really exceptional circumstances. For example, contextual bandits should count as reinforcement learning. It has the explicit concepts of time, states, action, and rewards and uses the same theory and often notation as the RL literature and it works commercially like for dynamic pricing and recommener systems. That's actually why I wrote some explanations for how they work. However, contextual bandits are a seriously truncated flavor of RL. It requires that there is only one action and an immediately observed reward attributable to that action. After that, the episode is over. It's literally many singleaction simulations. Now, in textbook RL, the goal is to optimize a whole sequence of actions, often under noisy and sparse rewards, which is exponentially harder. By assuming the real world application has the benefit of singlestep actions with immediate rewards, contextual bandits sidesteps one of the most difficult challenges of RL. That challenge is the credit assignment problem, where actions need to be attributed to long-term consequences, which is normally totally ambiguous. So since they only apply to these especially simple real world cases, bandit algorithms are far from the technologies dominating video games in AI demos. Okay, so we should ask, is there any full-fledged RL that is commercially successful? Well, I only found one convincing case, and that was from Seaman's Energy. Now, for many years, they've made a strategic bet on RL for energy management amongst other things. And as you'll see, their cases again really are exceptional. First of all, they are studying physical systems where precise laws of physics can be applied. This allows their simulations to be much more honest reflections of the real world than what you'd see in other commercial applications like those modeling human behavior. Second, they deal with huge data collected over very short time scales where an adjustment needs to be made every few milliseconds like balancing turbine flow or managing voltages across a grid. In this case, long sequences of actions need to be determined in quick succession and in an
Segment 2 (05:00 - 10:00)
environment where you can't collect a fixed batch of data to know what to do. This combination of requirements, fast feedback, accurate physics, and a lot of repetition makes RL especially fitting, but also it makes it unlike almost all other commercial applications. Now, I expect my audience to propose two potential counter examples. The first is what about reinforcement learning with human feedback for large language models? Isn't that full-fledged RL that works in industry? I'd argue it's not really true RL in the sense that people imagine. The human yes no feedback gives an immediate reward signal that bypasses a very hard component of RL that is sparse and delayed rewards. It's closer to supervised fine-tuning with the reward model bolted on than it is to training an agent in an open dynamic environment. It's using RL algorithms as a final optimization later, but it's not dealing with the classic challenges of RL. And also, you need massive pre-trained models and a huge amount of labeled human preference data before RLHF even makes sense. That's a very particular setup and not the kind of thing most companies can replicate. And the second counter example I expect to hear from the audience is, hey, what about Lyft? After all, I did say their application performed well and it did, but it also came at a cost. Earlier this year, the team that worked on the ARL application at Lyft published an article where they admit that after some time, they realized that having this RL agent acting in the market created some blinding complexity. With the dynamic agent updating itself and mysteriously optimizing metrics, it became hard to reason about and test ways to improve other components of their operations. Yes, you can get RL to deliver in the real world, but it's not necessarily clear whether it'll be worth it. So now I'll say what I personally believe is the main reason RL is so hard in real world applications. To me it comes down to evaluation. If you can evaluate your algorithm, meaning you can get an accurate estimate of its performance in the future, then you can optimize that evaluation and deploy a good model. This is fairly easy in classic supervised learning, and it's why supervised learning is applied everywhere. In real world reinforcement learning where you can't let an agent trial and error its way to good performance, you have to do offline policy evaluation which is hard and data hungry. In the general case, in offline evaluation, you assume the data is generated by one policy and then you use it to evaluate a proposed policy. As a reminder, the policy is the mathematical object that tells the agent how to act in any situation. Okay. Now, the data you get from the environment will necessarily change under the new policy. So you need to approximate this new unforeseen data by reweighing the existing data. And the further away you move from the data, the worse your approximations will be. And with highdimensional data, it's very easy to move away from your data. From one perspective, a big challenge like this is expected. RL handles data that the model itself generates, breaking the classic statistics assumption of independent and identically distributed data, meaning you don't get that assumption's great benefits. However, in practice, when you encounter nonID data, the more reliable path is to use other tools like time series forecasting or causal inference to model those dependencies directly rather than jumping to the full complexity of RL. Now, here's how I'd go about answering the question, should you use RL for a particular application? First, by the fact that the industry has been experimenting with RL for nearly a decade and we see supervised learning virtually everywhere and RL almost nowhere. You should start with a prior of there's a 99% chance you should just use supervised learning. From there, you'll need to gather enough reasons to override this prior. Okay. From here, the first question you should ask is how much does the data depend on the model's output or actions? As an extreme negative example, astronomers on Earth should never use reinforcement learning because nothing we see out there in the universe depends detectably on our actions on Earth. In other words, it is totally safe to assume the universe beyond our atmosphere has nothing to do with the code we're writing. A slightly more grounded example is the stock market. In most cases, you can assume that the market you observe isn't a result of your transactions. This isn't true if you're a Warren Buffett or a huge hedge fund since your trades can move the market, but generally it's a safe assumption. If this is the case, be happy. You can use the wonderfully effective techniques of supervised learning. You may not necessarily be able to assume iid data since there are other ways that assumption is violated, but you can safely ignore pure RL strategies. In fact, this reminds me of a famous quote from Vladimir Vapnik. When solving a problem of interest, do not solve a more general problem as an intermediate step. And that makes for a good moment to tell you about my sponsor, Hudson River Trading. If you're the kind of person who hears about these deep challenges like building reliable models in complex environments, and you get excited instead of intimidated, then you should really know about HRT. I'm very happy to have them as a sponsor, partly because I have a background in quantitative finance and have known about their
Segment 3 (10:00 - 13:00)
reputation for years. They're one of the top quantitative trading firms in the world and tackle some of the hardest problems in machine learning and computer science. and they don't do it in a lab. They test themselves against the ultimate unforgiving real-time environment that is the financial markets. This involves interesting and brainbending challenges. Their projects include things like building distributed file systems for massive scale data, forking Python to optimize for faster deployments, and carefully engineering model evaluation to approximate future performance. If these sound like projects you do well with, you should definitely check out their link in the description. All right, now let's get back to it. Okay, say you are in an environment where the data does depend on the model's outputs. The most commercially significant case in this category is recommener systems. For example, the movies people watch on Netflix are very heavily driven by Netflix recommendations. And this is true in many places like in e-commerce and on social media platforms. But even in those cases which are at large companies with huge R&D budgets, reinforcement learning still isn't the weapon of choice. Instead, there's a huge recommener system literature that directly addresses these issues. There standard practice is to turn recommener problems into essentially multiple prediction problems, normally via some matrix factorization with some neural nets mixed in. As an example, it's generally good enough to just be able to predict whether a user will watch a movie. The training data forming that prediction will be biased due to its dependency on a previous recommener, but that bias may not be large, and it's certainly not big enough to start using RL. If they're worried about this issue, it's better just run experimental deployments. But let's say we don't have the benefit of a well-funded open research community dedicated to our specific problem of how the data depends on the model's outputs. You can make this case for dynamic pricing where companies are a lot more cy about their algorithms. In that case, full RL still isn't the answer. The closest thing that they use is contextual bandits and the more standard approach is to use causal inference methods. Okay. So, are there cases where true full-fledged RL is something to recommend? Yes, at places like Seaman's Energy. But again, that's a company with huge data, a very experienced team, patience to get things wrong, and a problem so unique it can't be directly addressed with traditional methods. So, in summary, reinforcement learning is rarely the right tool for industry applications because the central challenge of reliably evaluating an agent's real world performance is unsolved. In practice, most RL successes are actually simpler methods in disguise or heavily simplified versions that avoid the core difficulties. The hard reality is that the vast majority of business problems are solved more effectively with traditional machine learning or statistics. And finally, let me pay the bills and plug my machine learning consultancy true theta. Think of us as an ML repair team. Companies call us when a critical machine learning system is underperforming or broken. We do an extensive audit of the system, put together a tech spec for the solution, get feedback on it from the team, and implement the fix alongside the engineers. And we've recently expanded our team to include senior MLES who've worked at Amazon, Microsoft, Capital One, and Lyft. We know what we're doing. We love this work. And if you're at a company that wants to work with us, you can email me directly at djtrutheta. io. Okay, I'll see you next time.