Solve problem using RL(Q-Learning specifically)

Solve problem using RL(Q-Learning specifically)

Cancelled

Job Description

the environment is implemented.
you need to implement the player with the learning algorithm

see attached pdf for full details
Solve problem using RL(Q-Learning specifically)

We would like to implement the game Snakes & Leaders
Goal
build an agent that can cope with a specific Markov Decision Process using Reinforcement Learning.
this MDP models a version of the game Snakes & Leaders with Stochastic snakes and leaders and without real cubes.

In this game, the player moves on a 1-dimensional squares board in a given size, and in every move the player chooses the number
of steps he would like to move forward or backward. an action of 0 steps is possible. after a move to square x, the stochastic
snakes and leaders system of the square is activated, and the player is being transfered to a certain square on the board with
accordance to Transition function that stands behind this stochastic system.
more rules:
* the game is limited to a known number of moves, and the goal is to accomulate as much prizes as possible through out the game.
* when the game starts a "higher power" position the player on the first square on the board.
*the prize is only given at the last square on the board, and this last square is also special in 2 more aspects:
-if the player is making a move to the last square, he stays there (hence getting the prize) with probability of 1.
-when a move of a player ends in the last square, the player picks up his prize, and being transfered by a "higher power"
to the first square of the board, and continues the game from there.
*to make it clearer: the state of the player is described fully by the square he is in, and the actions of the player picks the
number of steps and the direction from the current spquare (limited between a given range and also by the structure of the board)

General Architecture of the system:
the system you will implement will communicate iteratively with the Code that we supply. our Code simulates the environment of the
player, meaning the game board. starting from the initialization state(the player is in the first square and accumulate No prizes)
, with every iteration you choose an action to perform(number of steps that the player should move backward,forward or remain in
position), pass it to the environment Code, and you get back the updated state of the player on the board and the accumulated prize.
technically, state will be represented by the number of the square the player is in, and the accumulated prize is a Natural number
(accumulated prize >= 0). the Goal of the player is to accumulate as much prizes as he can.

Implementation
you need to Implement a class name PLAYER inside the file ex3.py . this class should at least contain the following 2 functions:
* __init__(self, board_size, max_p_step,max_n_step, max_steps)
this is the constructor function of the class and it gets the following parameters:
1. board_size - the size of the game board
2. max_p_step - the max size of forward move
3. max_n_step - the max size of backward move
4. max_steps - the number of moves that the player will perform in a game.

all the above parameters are none-negetive. this function has to intialize the learning algorithm that you should implement.
attention: this function will be executed with a time limit of 5 seconds, and it must finish till then.

receiving the current time from python (represented and a number of seconds since 1.1.1970) is executed by calling to time.time() .
it is recommended to call this function at the begining of the call to __init__ ,and then to call it again in places that are
importent to you during runtime, in order to calculate the remaining time to complete the calculations.

* next_move(self, square_num, new_round)
this function is used to update the current state(that is received after performing the previous action), choosing an action,
and learning(unless new_round=true; will be discussed soon). the test program will call this function, when:
-square_num - is the current position of the player on the board.
-new_round - flag that is true if and only if square_num=0 as a result of transfering the player to the first square by a
"higher power". this function should return an action to perform in the given state. the action is represented by a whole
number in the range of max_p_step to (-1)*max_n_step when 0 symbolizes the action to remaine in position.

additional files
the Code you receive (snl.py and check.py) is the environment simulator that will create an object from the class Player, that you
should implement, and execute it as explained above. you are not suppose to count on the boards that we supply, but this code
operates the player that you have built on 2 boards. in is recommended to experiment with other boards in different sizes and ....

Open Attachment