Notes

Notes - notes.io

What are some alternative learning paradigms besides Markov decision processes?
Sridhar Mahadevan
Sridhar Mahadevan, Fellow of AAAI
Answered May 10, 2018 · Upvoted by Nikhil Badugu, M.S Computer Science & Machine Learning, Northeastern University and Thommen George Karimpanal, PhD Reinforcement Learning & Machine Learning (2018)
Great question! First, we need some history to put things into perspective. Markov decision processes (MDPs) are a probabilistic model of sequential decision making, which originated in operations research (OR) from a remarkable PhD dissertation at MIT by Howard in the late 1950s. Howard introduced the famous policy iteration algorithm, which is the basis behind the currently popular “actor-critic” methods (which can be viewed as approximate policy iteration methods). They remain an extremely widely studied model in OR. Richard Bellman, a highly influential applied mathematician invented the term “dynamic programming”, and the name DP has stuck as well. So, MDPs were not invented recently by AI folks, but have been around for over 50 years.

Dimitri Bertsekas at MIT is a prolific author who has written many fine books on DP and MDPs, as well as a landmark book in 1996 on “neuro-dynamic programming” (which is basically the term he invented for reinforcement learning), along with co-author John Tsitsiklis. The NDP book is actually the first book on RL ever published, and it still remains the most authoritative treatment 20 years later. If you *really* want to understand RL, and by that I mean *when* do algorithms like Q-learning converge, and *why* nonlinear function approximation schemes like neural nets are highly unreliable and do not converge in general when combined with Q-learning, the NDP book is your desert island treatise. It won an award from the DP/OR community, and rightly so, for it at once introduced a huge community to this new field called RL, and furthermore, it gave RL a much needed shot in the arm of mathematical respectability. Maes’ PhD thesis from Univ. of Alberta also has a very nice summary of convergence results.

To summarize a huge literature on theory of RL, RL methods converge (very slowly usually) when no function approximation is introduced (so you use a table to store state action pairs and their values). Once you introduce even mild forms of function approximation (e.g, a linear architecture, where values are a dot product of state features and tunable weights), the news is generally bad. One can show ridiculously small MDPs (e.g., Baird’s classic counterexample from the mid 1990s is an MDP with 6 or so states, where Q-learning in effect “blows up”). So, this is as good a reason as any to look for alternatives. One can find fixes, but they are not easy to explain (one of my recent PhD students found just such a fix, and developed a truly convergent gradient form of Q-learning, but it also doesn’t converge in general, but only in special cases, and it took him 6 years of effort).

So, what are alternative learning paradigms to that proposed by the MDP/RL paradigm? Like most scientific theories, MDPs make many assumptions, which are fine in a factory setting, but questionable in a model of animal or human decision making.

The world can be modeled as an MDP, by which we assume there exist “states” under which the world is “Markov”, so “states” summarize past history so they are a sufficient statistic for taking decisions. This idea is at the core of the MDP model. In a factory/OR type situation, of course, one can control what is defined as state. A machine used in a factory to make parts has only very limited things it can do (paint, drill, etc.). So, here, states make perfect sense. When you apply the same concept to animal or human perceptions, huge problems arise. Is your current visual image the “state”? No, of course, because it might be what you remember from a few minutes ago that is critically important to what action you take. Take the famous Atari video game environment popularized by Deep Mind (at the rate at which folks are burning energy running their GPU machines on this domain, global warming may actually happen a bit sooner than we had assumed!). Is the current image of the game a valid “state”? No! So, what Deep Mind researchers did was a bit of a hack. They looked at the past 4 images. Why 4? No particular reason, just a hack that works. Unfortunately this tells us nothing about how to generalize this approach to other domains where the last 4 images is not enough. For example, you park your car in a huge parking lot, and go to work (perhaps at one of these giant tech companies with tens of thousands of employees). In the evening, you need to find your car again Guess what? You can’t if you only remember the last 4 images. You need to remember all the way back to the morning when you parked it! How do we know how far back to go? No one knows how to solve this problem because it can’t be solved. In reality, there are always “unobservable confounders”, the phrase that researchers who study decision making in the health sciences or public policy or management or any real-world complex task will tell you.
Tasks are defined purely by reward functions. So, when you teach your Atari agent the Enduro task, the only information you give it is the reward for solving the task. In reality, this is far from the actual truth in terms of how humans learn driving. I have been driving for over 30 years. When I moved to California, and needed to get a new drivers license, I had to pass a tough written exam (only 3 answers out of 36 questions can be wrong, or you take it again). You have to study a big handbook they give you. No, driving is not about motor control, it is a knowledge-based task where there are many do’s and even more don’t’s! If you want to make a right turn, and the right hand lane is a bike lane, you are allowed to encroach on it only a certain distance from the turn, and you need to know what that distance is. Of course, an RL agent will try to learn that from reward, get millions of dollars of tickets and eventually “get it”, but humans have a simpler solution. We just read the handbook and memorize the answer. It’s faster than RL. It’s what we do often.
Agents — animals and robots and humans — should maximize expected utilities. This piece of fiction has been in existence for the past 40–50 years, and has been thoroughly disproved in economics (leading to several Nobel prizes). Unfortunately, this bit of information has not yet percolated to the ML/RL community yet. Kahneman and Tversky spent their lives studying human decision making — imagine the novelty of that concept when you contrast that to what’s happening in RL! Instead of studying a mathematical abstraction of decision making, you actually do field work, and see what people do. You put them in decision making situations and evaluate what they do. KT did exactly that, and found that people do not maximize expected utilities. Their work was so famous that it led to Kahneman winning the Nobel Prize (sadly, Amos Tversky had passed away, and Nobels are not awarded posthumously). Richard Thaler recently won another Nobel for extending their work and making behavioral economics one of the hottest areas in economics.
So, if you are reading this out there in cyberspace, wherever you might be, I hope you are motivated to “think outside the MDP box” and begin the process of constructing a new paradigm. We don't have to reinvent the wheel. Behavioral economics gives us some valuable insights. Similar insights have come from others, like Herbert Simon — another economics Nobel laureate who spent his life showing that organizations do not optimize, but find satisficing solutions — and Gerd Gigerenzer at the Max Planck Institute in Germany who explored how humans make decisions, and written popular books with titles such as “Simple Heuristics that Make Us Smart”. A baseball player running to catch a long shot could do all kinds of mathematical calculations on speed, ball’s trajectory, wind speed and churn out partial differential equations in his head to decide how fast he should run to catch the ball. In reality, he doesn’t because he cannot. There’s a simple heuristic that all ball players know and it works surprisingly well, about keeping your head at an angle so the ball remains within your eyesight, and adjusting your speed accordingly.

Often, AI researchers are motivated from physics and math to look for simple elegant mathematical theories. One of the biggest critics of this approach was the famous biologist Francis Crick, arguably the greatest biologist of the 20th century. He said ‘don’t look for simplicity in biology”. What he meant was that given the evolutionary history of every organism, including man, what you are likely to find when you probe an organism’s behavior is a patchwork of solutions, one built on top of the other, so that you get redundancy. Many studies of decision making in animals, from ants and bees to birds, dogs, and even humans, have found corroborative evidence for this approach.

So, where does that leave RL research? I’ve spent almost 30 years working in this field, so I’m speaking from long experience when I say the field really needs a top-to-bottom overhaul, and it must begin with the basic fact that before one constructs elaborate theories of sequential decision making, one should look at empirical field work to see how organisms actually make decisions. When one does that, as behavioral economists, ethologists, public policy experts and many others have done, one finds very little support for the MDP model. This model was invented for decision making on the factory floor. That’s where it belongs. In the real world, we need a better solution.

I hope you will begin to explore alternative paradigms, since we could surely use some help in solving this problem! As Geoff Hinton put it so well, “science proceeds one funeral at a time”, and it is high time to give MDPs a funeral rite, and move on. One final plea: before you run an Atari video game simulation on your very power hungry GPU machine, please consider what impact you are having on the natural environment. Is it worth the cost? Do we need more papers on how to (re)solve the Atari video games? Why not look at real-world decision making by humans or animals instead?

Notes is a web-based application for online taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000+ notes created and continuing...

With notes.io;

* You can take a note from anywhere and any device with internet connection.
* You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
* You can quickly share your contents without website, blog and e-mail.
* You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
* Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 14 years and has been free since the day it was started.

You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;

Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio

Regards;
Notes.io Team

Notes

Notes - notes.io

Shortened Note Link

Long File

Notes