Dynamic Programming and Optimal Control: Session 2

I started today by showing you the centipede game. The analysis of this game is by a backward induction, similar to that which we use in finite-horizon dynamic programming.

In this class we had an important theorem (Theorem 3.1). It had an easy but nontrivial proof. Theorem 3.1 states that $F(x)$ satisfies a DP equation (3.7). It holds under the assumption of D (discounted), N (negative) or P (positive) programming. I explained what these three assumptions mean.

I gave an example, $c(x,u) = x$, $a(x,u) = –x$, $\beta = 1$, for which $F(x)$ does not exist. Note that a problem with this data fails to satisfy D, N or P.

The example in Section 3.5 (of selling a tulip bulb collection) is very much like the secretary problem in Section 2.3. The difference is that now (i) we observe values (not just ranks), (ii) wish to maximize the expected value of the selected candidate (rather than probability of choosing the best), and (iii) the number of offers is infinite, but with a discount factor beta.

One important way in which discounting by a factor beta can naturally arise is when a catastrophe can occur, with probability $1–\beta$, bringing the problem to an end. We discussed the fact that the policy does not change if offers remain open.

In Section 4.3 (optimal gambling) we saw that timid play is optimal in the gambling problem when the game is favorable to the gambler ($p \geq 0.5$).

Consider this question: "A gambler has £2 and needs to increase it to £10 in a hurry. The gambler decides to use a bold strategy in which he stakes all his money if he has £5 or less, and otherwise stakes just enough to increase his capital, if he wins, to £10."

When $p \leq 0.5$ bold play maximizes the probability that the gambler reaches N pounds. However, bold play may not be uniquely optimal.

When it was first discovered that bold play is not uniquely optimal, it was contrary to the accepted wisdom at that time (1955). It impressed Leonard Savage when it was presented to him by a Ph.D. student named Lester Dubins They developed a collaboration culminating in the famous monograph How to Gamble if You Must (Inequalities for Stochastic Processes). See the fourth paragraph of this obituary of Lester Dubins.

If $p=0.5$ all strategies are optimal. How could we prove that? Easy. Simply show that given any policy $\pi$ the value function is $F(\pi,i)=i/N$ and that this satisfies the DP equation. Then apply Theorem 4.1. You can read more about these so-called red and black games at the Virtual Laboratories in Probability and Statistics.

In Section 4.5 (pharmaceutical trials) I introduced an important class of very practical problems. One obvious generalization is to a problem with $k$ drugs, each of which has a different unknown probability of success, and about which we will learn as we make trials of the drugs. This is called a multi-armed bandit problem. The names comes from thinking about a gaming machine (or fruit machine), having $k$ arms that you might pull, one at a time. The arms have different unknown probabilities of generating payouts. In today's lecture we considered a special case of the two-armed bandit problem in which one of the arms has a known success probability, $p$, and the second arm has an unknown success probability, $\theta$. I will say more about these problems in latter classes. The table on page 18 was computed by value iteration. This table is from the book Multi-armed Bandit Allocation Indices (of which I am one of the authors). The amazon.co.uk page will let you look inside this book.

Dynamic Programming and Optimal Control

Monday, January 28, 2013

Session 2

Statcounter