|
Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of research on Markov decision processes resulted from Ronald A. Howard's book published in 1960, ''Dynamic Programming and Markov Processes''. They are used in a wide area of disciplines, including robotics, automated control, economics, and manufacturing. More precisely, a Markov Decision Process is a discrete time stochastic control process. At each time step, the process is in some state , and the decision maker may choose any action that is available in state . The process responds at the next time step by randomly moving into a new state , and giving the decision maker a corresponding reward . The probability that the process moves into its new state is influenced by the chosen action. Specifically, it is given by the state transition function . Thus, the next state depends on the current state and the decision maker's action . But given and , it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP process satisfies the ''Markov property''. Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state and all rewards are the same (e.g., zero), a Markov decision process reduces to a Markov chain. ==Definition== A Markov decision process is a 5-tuple , where * is a finite set of states, * is a finite set of actions (alternatively, is the finite set of actions available from state ), * is the probability that action in state at time will lead to state at time , * is the immediate reward (or expected immediate reward) received after transition to state from state , * is the discount factor, which represents the difference in importance between future rewards and present rewards. (Note: The theory of Markov decision processes does not state that or are finite, but the basic algorithms below assume that they are finite.) 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Markov decision process」の詳細全文を読む スポンサード リンク
|