辅导案例-CSCI 360

CSCI 360 – Project #3 1 Introduction In this project, you will learn about MDPs by playing Frozen Lake1. The description of Frozen Lake from the OpenAI website is as follows: Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you’ll fall into the freezing water. At this time, there’s an international frisbee shortage, so it’s absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won’t always move in the direction you intend. Markov Decision Process Formulation We formalize the problem as a Markov Decision Process (MDP). Before diving into the definitions of states, actions, transition probabilities, and rewards, we introduce the slightly different notation used in this project and the changes you need to make in order to apply the algorithms you learned in class to this project. Notation In the lecture, we defined the cost c(s) = ci of state s = si as c(s) = min a ∑ s′ p(s′|s, a)(c(s, a, s′) + γc(s′)), (1) where c(s, a, s′) is the action cost that you incur when you start in state s, execute action a, and transition to state s′. c(s) is the minimum expected total discounted cost that you incur when you start in state s and behave optimally. We further define the q-value q(s, a) of state s and action a as q(s, a) = ∑ s′ p(s′|s, a)(c(s, a, s′) + γc(s′)). (2) 1https://gym.openai.com/envs/FrozenLake-v0/ 1 q(s, a) is the minimum expected cost that you incur when you start in state s, execute action a, and then behave optimally. Then, c(s) can be written as c(s) = min a q(s, a). (3) In this project, we deal with action rewards r(s, a, s′) instead of action costs c(s, a, s′), where r(s, a, s′) is the action reward that you incur when you start in state s, execute action a, and transition to state s′. Costs are simply negative rewards, so the objective changes from minimizing costs to maximizing rewards. We define the value v(s) of state s similar to c(s) as v(s) = max a ∑ s′ p(s′|s, a)(r(s, a, s′) + γv(s′)), (4) and the q-value q(s, a) of state s and action a as q(s, a) = ∑ s′ p(s′|s, a)(r(s, a, s′) + γv(s′)). (5) In other words, except for the different notation, we only changed the minimization to a maximization in the value iteration, policy iteration, and q-learning algorithms learned in class. S1 F F S2 F H F H F H F F F R H G 0 1 2 3 0 1 2 3 x y Figure 1: 4×4 Frozen Lake Example States We show an example of a 4×4 frozen lake in Figure 1. The states are locations. The letters in the locations mean the following: S1 and S2 are start states. You will be placed with uniform probability in one of the start states when the game starts. Such states are safe. The reward for visiting them is zero. F is a state with a frozen surface. Such states are safe. The reward for visiting them is zero. R is a state with a reward. Such states are safe. The reward for visiting them is positive. H is a terminal state with a hole, where you lose (by falling into the lake) and the game terminates. The reward for visiting them is negative. G is a terminal 2 state with a frisbee, where you win (by retrieving the frisbee) and the game terminates. The reward for visiting them is positive. There can be multiple states of each kind in a map. We use S1(0, 0) to denote a start state at location (x = 0, y = 0) and G(3, 3) to denote winning terminal state at location (x = 3, y = 3). Actions In each state, you can choose to move UP, DOWN, LEFT, and RIGHT, except if you would leave the map. For example, in S2(0, 3), you cannot choose to move UP or RIGHT. Transition Probabilities The transition probability p(s′|s, a) encodes the probability with which you transition to state s′ when you start in state s and execute action a. Since the ice is slippery, you move in an unintended direction with slip probability p. (If p = 0, then actions have deterministic rather than stochastic effects.) We assume that you move in the intended direction with probability 1 − p and slip in to the left and right (if possible) with equal probability. For example, if you are in state F (1, 2) and choose to move DOWN, then you transition to F (2, 2) with probability 1− p, to H(1, 1) with probability p/2, and to H(1, 3) with probability p/2. If you are in state S2(0, 3) and you choose to move LEFT, then you transition to F (0, 2) with probability 1− p and to H(1, 3) with probability p. Rewards The rewards for visiting states R or G are positive, the rewards for visiting states S and F are zero, and the rewards for visiting states H are negative. Different maps have different rewards. The action reward r(s, a, s′) is the reward for visiting state s′. Objective You want to maximize the expected total discounted reward for different frozen lakes using value iteration, policy iteration, and q-learning. Value iteration and policy iteration apply in case the dynamics of the MDP are known a priori, that is, you know all transition probabilities and rewards before executing any actions. This allows you to predict the outcomes of action executions and enables you to plan offline, that is, before executing any actions. Q-learning can be used even in case the dynamics of the MDP are unknown. You are then not able to predict the outcomes of action executions and have to execute actions to learn the dynamics of the MDP, which can be dangerous because you are (at least initially) not able to predict whether you might fall into the lake. 3 2 Tasks 2.1 Installing Python 3 The frozen lake environment is written in Python 3. It requires a Python 3 installation with numpy. You can ignore this part if you kept your Project 2 development environment. On Ubuntu Linux, download python by using: sudo apt install python3 python3-numpy OnMac OS X, download Anaconda 2019.03 with Python 3.7 from https://www.anaconda. com/distribution/#download-section and follow the installer. You can verify your instal- lation by using: python3 –version On Windows, download Anaconda 2019.03 with Python 3.7 from https://www.anaconda. com/distribution/#download-section. On Ubuntu Linux and Mac OS X, use python3 to run python. On Windows, use python instead. 2.2 Experiments Download the project folder and unzip it. Change the directory with cd to the frozen_lake folder and run: python3 play_game.py –map maps/bridge.json You should see a GUI like the following: Figure 2: Example Frozen Lake Environment 4 The top two lines contain metadata, including the slip probability p and the mapping from states to the rewards for visiting them. You start a game episode by clicking RESET. Then you play the game by clicking the different MOVE buttons. Your current state is always highlighted in yellow. 2.3 Questions [5 pts] First, review the book chapters 17.2, 17.3, and 21.3 and the lecture slides. Always show all your work, not just your answer to the following questions. When specifying a policy, you only need to show the actions for non-terminal states. 2.3.1 bridge.json Run python3 play_game.py –map maps/bridge.json and answer the following question for γ = 0.99: • What is the optimal policy (that is, mapping from states to actions) and the value v(S) of start state S? (You can compute the optimal policy backward from the winning terminal states.) 2.3.2 bridge_stochastic.json Run python3 play_game.py –map maps/bridge_stochastic.json and answer the follow- ing questions for γ = 1.0: • Which winning terminal state (G1 or G2) should you choose to move to if you are at S(1, 1)? Which winning terminal state (G1 or G2) should you choose to move to if you are at F (1, 3)? Explain why. • What are the q-values q(S(1, 1), LEFT ) of the start state S(1, 1) and action LEFT and q(F (1, 4), RIGHT ) of state F (1, 4) and action RIGHT? • What are the values v(S(1, 1)), v(F (1, 2)), v(F (1, 3)), and v(F (1, 4))? 2.3.3 bridge_near_stochastic.json Run python3 play_game.py
–map maps/bridge_near_stochastic.json and answer the following questions for γ = 1.0: • What is the optimal policy? • What is a value for the discount factor γ so that the optimal policy changes and what is the optimal policy for this value of γ? Explain why the optimal policy changes using your understanding of the effect of the discount factor. 5 2.3.4 frozen_lake_8x8.json Run python3 play_game.py –map maps/frozen_lake_8x8.json –not_fully_visible and answer the following question for γ = 0.99. The flag –not_fully_visible hides the dynamics from you: • What exploration strategy did you use when playing this frozen lake environment? Remember that you want to achieve an expected total discounted reward that is as large as possible. This means that you need to balance between exploring actions and exploiting the currently seemingly best action. 3 Code Structure and APIs The source code is in the src folder, and the test code is in the test folder. Try to understand the code in • FrozenLake.hpp/cpp, • ValueIterationAgent.hpp/cpp, • PolicyIterationAgent.hpp/cpp, and • QLearningAgent.hpp/cpp. You probably want to look into the test folder when you debug. You are not required to look at any other support files. 3.1 APIs The environment is implemented in FrozenLake.hpp/cpp. We implement two classes, FrozenLakeEnv and FrozenLakeMDP, where FrozenLakeMDP is a subclass of FrozenLakeEnv. You will work with FrozenLakeMDP when implementing value iteration and policy iteration and with FrozenLakeEnv when implementing q-learning. The only difference between them is that FrozenLakeMDP has access to all rewards and transition probabilities of the underlying MDP. The APIs of FrozenLakeEnv are: • GameState reset(), which resets the environment and returns your start state with uniform probability; • GameState getNextState(const GameState &state, const Action &action), which samples the successor state for the given state and action; • std::vector getPossibleActions(const GameState &state) const, which returns a vector of actions that can be executed in the given state; 6 • double getReward(const GameState &state, const Action &action, const GameState &nextState) const, which returns the reward when the given successor state results from executing the given action in the given state (here: the reward of visiting the given successor state); and • bool isTerminal(const GameState &state) const, which returns whether the given state is a terminal state. There are two additional APIs for FrozenLakeMDP: • std::map getTransitionStatesAndProbs(const GameState &state, const Action &action) const, which returns a mapping from states to the transition probabilities with which the execution of the given action in the given state results in the states as the successor states; and • std::set getStates() const, which returns a set of all states. The above APIs are all APIs that you need for your code. 3.2 Building the Project Build the project as follows: cd frozen_lake mkdir build; cd build; cmake ../ make which will create two targets, namely frozen_lake and run_test. The code has been tested on macOS Catalina 10.15.1, Ubuntu 18.04, and Windows 10 using Visual Studio. frozen_lake is in the src folder. ./frozen_lake –help prints the help message. The program takes an agent, a map, and a set of parameters. It trains the agent on the map and tests the agent on the same map for test_episodes test episodes and records the average and standard deviation of the total discounted reward over all episodes. For example, ./frozen_lake –agent v –map ../maps/cliff_stochastic.json –theta 1e-7 –iteration 50 –gamma 0.99 –test_episodes 10000↪→ loads map ../maps/cliff_stochastic.json, solves the MDP using value iteration with a limit of 50 iterations, error threshold ε = 1e− 7, and discount factor γ = 0.99. Afterwards, it tests on the same map for 10,000 episodes and prints the mean and standard deviation of the total discounted reward over all episodes. run_test is in the test folder. It runs basic tests to verify your implementation. We will run these tests but also additional tests (for example, to avoid someone trying to hard-code policies) while grading. 7 4 Value Iteration [25 pts] In this part, you will implement the ValueIterationAgent class and test it on various FrozenLakeMDP configurations. The constructor of the ValueIterationAgent class is: ValueIterationAgent::ValueIterationAgent(FrozenLakeMDP const &mdp, double gamma, int iterations, double threshold) :↪→ ValueEstimateAgent(gamma, iterations, threshold), m_mdp(mdp) { initialize(); solve(); } The arguments of the constructor are: • FrozenLakeMDP const &mdp, which is a frozen lake MDP instance defined in Frozen_lake.hpp; • double gamma, which is the discount factor; • int iterations, which is the iteration limit; and • double threshold, which is the error threshold. The constructor calls initialize(), where you should initialize any variables that you want to define, and solve(), where you should solve the MDP. Your task is to implement the class methods marked with TODO in ValueIterationAgent.cpp: • double ValueIterationAgent::getValue(const GameState &state), which re- turns the value of the given state; • double ValueIterationAgent::getQValue(const GameState &state, const Action &action), which returns the q-value of the given state and the given action; • Action ValueIterationAgent::getPolicy(const GameState &state), which re- turns the action to execute in the given state; • void ValueIterationAgent::solve(), which solves the MDP with value iteration and stores the optimal policy in a private class member. The autograder will call getValue, getQValue, and getPolicy to verify its correctness. Value iteration should terminate once it reaches the given iteration limit m_iterations or the absolute error of the value of each state from one iteration to the next one is less than the given error threshold m_threshold; and • void ValueIterationAgent::initialize(), which initializes the internal variables. (You need to declare them as private class members before initializing them.) You can run all value iteration tests using ./run_test –gtest_filter=”ValueIteration*”. You can run individual tests using CLion. 8 5 Policy Iteration [25 pts] In this part, you will implement the PolicyIterationAgent class. The class methods look very similar to the ones of value iteration, with the following differences: • You are not allowed to create any new class members. The optimal policy is stored in std::map m_policy. Remember to initialize this data structure in void PolicyIterationAgent::initialize(). • Implement the helper function virtual std::map evaluateCurrentPolicy(), which evaluates the current policy and returns a map that stores the value of each state. Instead of solving linear equations using Gaussian elimination, you can choose to implement the iterative approach discussed in the lecture. The iterative approach should terminate when the absolute error of the value of each state from one iteration to the next one is less than the given error threshold m_threshold. • Policy iteration in void PolicyIterationAgent::solve() should terminate once it reaches the given iteration limit m_iterations or the policy does not change from one iteration to the next one. You can run all policy iteration tests using ./run_test –gtest_filter=”PolicyIteration*”. 6 Q-Learning [40 pts] In this part, you will implement the QLearningAgent class. We have already implemented void QLearningAgent::solve() for you, which implements the following learning procedure for m_iterations iterations: 1. Reset the environment to obtain your start state. 2. Call getAction to obtain the next action to execute. 3. Call m_env.getNextState(state, action) to obtain the resulting successor state. 4. Call m_env.getReward(state, action, nextState) to obtain the resulting reward. 5. Call update(state, action, nextState, reward) to update the corresponding q- value. (You need to implement it.) 6. If the successor state is not a terminal state, then go to 2. 7. Evaluate the current policy and reco
rd its expected total discounted reward into a file so that you can plot it later. Do this by calling getPolicy instead of getAction. getPolicy executes the current policy, which consists of the seemingly best action 9 in every state, while getAction also explores other actions. We use the ε-greedy exploration strategy, where you choose a random action in your current state s with probability ε and the currently seemingly best action argmaxa q(s, a) with probability 1− ε. Your task is to implement getValue, getQValue, getPolicy, getAction with the ε- greedy exploration strategy, update, and initialize. You can declare any variables inside the class. The class has two more members as hyperparameters: • m_alpha, which is the learning rate of q-learning. • m_epsilon, which is the probability of choosing a random action of the ε-greedy exploration strategy. We provide the performance of our code on various maps for your help but please note that the results of q-learning vary across different runs even with the same hyperparameters. The provided performance was achieved for most of our runs. • ./frozen_lake –agent q –map ../maps/bridge.json –epsilon 0.99 –alpha 1.0 –iteration 500 –gamma 0.99 should achieve an expected total discounted reward of 9.703. • ./frozen_lake –agent q –map ../maps/bridge_stochastic.json –epsilon 0.2 –alpha 0.05 –iteration 500 –gamma 0.99 should achieve an expected total discounted reward of around [−20,−18]. • ./frozen_lake –agent q –map ../maps/bridge_near_stochastic.json –epsilon 0.2 –alpha 0.1 –iteration 200 –gamma 0.99 should achieve an expected total discounted reward of around [70, 90]. • ./frozen_lake –agent q –map ../maps/cliff_stochastic.json –epsilon 0.2 –alpha 0.2 –iteration 100 –gamma 0.99 should achieve an expected total discounted reward of around 0.94. Subtask 1: Experiment with α Run frozen_lake_8x8.json with γ = 0.99, ε = 0.1, and α = 0.01, 0.05, 0.1, 0.2 and 0.4 for 2,000 iterations and plot a figure with 5 graphs, where the x-axis is the iteration and the y-axis is the expected total discounted reward during the evaluation of the policy (since the expected total discounted reward while learning the policy can be less due to the exploration strategy). Each graph should be the average over 3 learning episodes since you might move differently during each learning episode. Make sure to clearly label each graph with the value of α. Run value iteration (or policy iteration) on this environment and report the difference in expected total discounted reward for the q-learning algorithms with different values of α and value iteration (or policy iteration). Summarize your findings of how α affects the results of q-learning and why q-learning on large maps is difficult. 10 After learning, QLearnerAgent will output a csv file named result.csv in the current working directory. You can either open it with Microsoft Excel or Python package pandas and plot it with matplotlib. Subtask 2: Experiment with ε Run frozen_lake_8x8.json with γ = 0.99, α = 0.05, and ε = 0.01, 0.1, 0.2, and 0.5 for 2,000 iterations and plot a figure with 4 graphs, where the x-axis is the iteration and the y-axis is the expected total discounted reward. Each graph should be the average over 3 learning episodes. Make sure to clearly label each graph with the value of ε. Summarize your findings of how ε affects the results of q-learning. Subtask 3: Experiment with Counting-Based Exploration Implement the counting-based exploration strategy, which prefers to execute actions that have not been executed many times yet. It augments the q-value with the number of times that action a has been executed in state s as q′(s, a) = q(s, a)+βN(s, a)− 1 2 [1]. The modified q update rule becomes: q′(s, a) = q′(s, a) + α(r(s, a, s′) + γmax a′ q′(s′, a′)− q′(s, a)) (6) and the optimal policy is argmaxa q′(s, a). You will need to • maintain N(s, a) for each state s and action a; • update N(s, a) in update(); and • modify getAction() and getQValue() to include the counting-based exploration strat- egy and meanwhile disable ε-greedy exploration. Run frozen_lake_8x8.json with γ = 0.99, α = 0.05, and different values of β ≥ 0 for 2,000 iterations and plot a figure with 2 graphs, where the x-axis is the iteration and the y-axis is the expected total discounted reward. One graph is for q-learning with the ε-greedy exploration strategy for the best value of ε that you found, and the other graph is for q-learning with the counting-based exploration strategy for the best value of β that you found. The graph for q-learning should be the average over 3 learning episodes. Make sure to clearly label each graph with the exploration strategy and the value of ε or β used. You should be able to achieve optimal policy for appropriate β. We will grade by how close your performance in q learning using counting-based exploration vs. the optimal performance, which can be calculated by value/policy iteration. Summarize your findings of how the exploration strategy affects the results of q-learning. 11 Submissions You need to submit your solutions through the blackboard system by Monday, November 25th, 11:59pm (which is the time when the grace period starts). For the programming part, submit your code in a zip file named “code.zip” and your report as a PDF file named “report.pdf,” that contains the answers to all questions in Section 2.3 and the figures and insights into the tasks in Section 6. Upload the two files individually and not as an archive. If you have any questions about the project, you can post them on Piazza or ask a TA directly during office hours. If you have a bug in your code that you are unable to find, you can ask a CP for help. References [1] Strehl, A. L., and Littman, M. L. An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 74 (2008), 1309–1331. 12

辅导案例-CSCI 360

Related

Previous Post辅导案例-CSCI 201

Next Post辅导案例-CISC 435-Assignment 3

Author admin