Dynamic Programming and Optimal Control: Session 5

You will find that the policy improvement algorithm is useful in answering Problems 1 and 2 on Problems Sheet 3. In Problem 1, you will find $\lambda$ and $\phi$ for a certain policy ("On seeing a filling station, stop and fill the tank"), and then look for a condition that the policy improvement algorithm will not improve it (or, equivalently, that this $\lambda$ and $\phi$ satisfy the optimality equation.

In Problem 2 you will use policy improvement idea in the context of a discounted-cost problem. You find $F(x)$ for a simple policy (in which the engineer allocates his time randomly), and then improve it by a step of the policy improvement algorithm. This leads to a policy in which the engineer puts all his maintenance effort into the machine with greatest value of $c_i(x_i+q_i)/(1-\beta b_i)$. This policy is better, but may not yet be optimal.

I have a personal fondness for some things I talked about the second half today. Theorem 10.1 (and extensions of it to other distributions beyond exponential) was one of the things that I proved in my Ph.D. dissertation in 1980. The Lady's Nylon Stocking Problem was a fairly famous unsolved problem. It gave me something to talk about with non-mathematically inclined fellows when I first became a research fellow. The paper is Scheduling jobs with stochastic processing requirements on parallel machines to minimize makespan or flow time. J. Appl. Prob. 19, 167-182, 1982.

The method of uniformization that I described in this lecture seems to have been introduced by Jensen (1953), but it was first used extensively in solving queueing problems in the late 1970s (by Grassman (1977) and Keilson (1979)). Here is a quote from a recent 2011 paper by Rhonda Righter, Stochastic sequential assignment problem with arrivals, Probability in the Engineering and Informational Sciences, 25, 2011, 477–485:

Proof. We consider the equivalent discrete-time problem by uniformizing with uniformization rate $\lambda + \gamma + \alpha = 1$, and we use value iteration (i.e., induction on a finite time horizon, $n$).

The Wikipedia article has more about it and summarises the method in these words:

The method involves the constructions of an analogous discrete time Markov chain, where transitions occur according to an exponential distribution with the same parameter in every state.

I have used this idea many dozens of times in my own research. It is usually the best first step to make when tackling a continuous-time Markov decision problem.

Dynamic Programming and Optimal Control

Saturday, February 16, 2013

Session 5

Statcounter