翻訳と辞書
Words near each other
・ Q-Free
・ Q-function
・ Q-Games
・ Q-gamma function
・ Q-Gaussian distribution
・ Q-Genz
・ Q-go
・ Q-guidance
・ Q-Hahn polynomials
・ Q-in-Law
・ Q-Jacobi polynomials
・ Q-Konhauser polynomials
・ Q-Krawtchouk polynomials
・ Q-Laguerre polynomials
・ Q-LAN
Q-learning
・ Q-Less
・ Q-Lets and Co.
・ Q-MAC Electronics
・ Q-machine
・ Q-matrix
・ Q-Max
・ Q-Meixner polynomials
・ Q-Meixner–Pollaczek polynomials
・ Q-music TV
・ Q-Net
・ Q-Notes
・ Q-Operating-System
・ Q-par Angus
・ Q-Park


Dictionary Lists
翻訳と辞書 辞書検索 [ 開発暫定版 ]
スポンサード リンク

Q-learning : ウィキペディア英語版
Q-learning

Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.
== Algorithm ==

The problem model consists of an agent, states ''S'' and a set of actions per state ''A''. By performing an action a \in A, the agent can move from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by learning which action is optimal for each state. The action that is optimal for each state is the action that has the highest long-term reward. This reward is a weighted sum of the expectation values of the rewards of all future steps starting from the current state, where the weight for a step from a state \Delta t steps into the future is calculated as \gamma^. Here, \gamma is a number between 0 and 1 (0 \le \gamma \le 1) called the discount factor and trades off the importance of sooner versus later rewards. \gamma may also be interpreted as the likelihood to succeed (or survive) at every step \Delta t.
The algorithm therefore has a function that calculates the Quantity of a state-action combination:
:Q: S \times A \to \mathbb
Before learning has started, ''Q'' returns an (arbitrary) fixed value, chosen by the designer. Then, each time the agent selects an action, and observes a reward and a new state that may depend on both the previous state and the selected action, "Q" is updated. The core of the algorithm is a simple value iteration update. It assumes the old value and makes a correction based on the new information.
:Q_(s_,a_) = \underbrace_ + \underbrace_ \cdot \left( \overbrace + \underbrace_ \underbrace, a)}_}^ - \underbrace_ \right)
where ''R_'' is the reward observed after performing a_ in s_, and where \alpha_t(s, a) (0 < \alpha \le 1) is the learning rate (may be the same for all pairs).
An episode of the algorithm ends when state s_ is a final state (or, "absorbing state"). However, Q-learning can also learn in non-episodic tasks. If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.
Note that for all final states s_f, Q(s_f, a) is never updated and thus retains its initial value. In most cases, Q(s_f,a) can be taken to be equal to zero.

抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)
ウィキペディアで「Q-learning」の詳細全文を読む



スポンサード リンク
翻訳と辞書 : 翻訳のためのインターネットリソース

Copyright(C) kotoba.ne.jp 1997-2016. All Rights Reserved.