In this project, I will construct an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: Safety and Reliability. A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable. Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.
Safety and Reliability are measured using a letter-grade system as follows:
Grade | Safety | Reliability |
---|---|---|
A+ | Agent commits no traffic violations, and always chooses the correct action. |
Agent reaches the destination in time for 100% of trips. |
A | Agent commits few minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 90% of trips. |
B | Agent commits frequent minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 80% of trips. |
C | Agent commits at least one major traffic violation, such as driving through a red light. |
Agent reaches the destination on time for at least 70% of trips. |
D | Agent causes at least one minor accident, such as turning left on green with oncoming traffic. |
Agent reaches the destination on time for at least 60% of trips. |
F | Agent causes at least one major accident, such as driving through a red light with cross-traffic. |
Agent fails to reach the destination on time for at least 60% of trips. |
To assist evaluating these important metrics, a visualization code will be used later on in the project. The code in the cell below is used to import this code which is necessary for the upcoming analysis.
# Import the visualization code
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
Before starting to work on implementing my driving agent, it's necessary to first understand the world (environment) which the Smartcab and driving agent work in. One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates. To begin, simply run the agent.py
agent code in it's No Learning state -- to do this open agent.py
, scroll to the bottom to the run
method, comment the active lines of code within this method and uncomment the subsequent lines as per the instructions for each active line of code.
Then save and run the code either in a Jupyter Notebook from the top-level directory where this notebook is located using:
'run smartcab/agent.py'
or from the same top-level directory in a terminal window using:
'python smartcab/agent.py'
Let the resulting simulation run for some time to see the various working components. Note that in the visual simulation (if enabled), the white vehicle is the Smartcab.
When running the default agent.py
agent code, some questions to consider:
Answer:
The white smartcab does not move very often; over about 5 minutes the cab remains in one spot without moving, but is then randomly placed somewhere else.
The warning "!!Agent state not updated!" never changes. Underneath the heading Training Trial is the unchanging red warning phrase "Previous Trial: Failure". The rest of the grid changes with each update.
Underneath "!!Agent state not updated!" is printed the phrase "No action taken." that is either colored red when there is a negative reward for not going through a green light, or green for sitting idle during a red light. It is followed by the value of the positive or negative reward. If the light is green and the smartcab acts correctly without causing a violation or an accident there is a positive reward. There is also an additional reward for following the waypoint. Not following the waypoint when acting on a green can still yield positive reward.
Underneath that phrase is another phrase that describes the action the cab took and is colored red, yellow, or green depending on the appropriateness of the action.
Underneath that phrase is the statement "Agent not enforced to meet deadline."
In addition to understanding the world, it is also necessary to understand the code itself that governs how the world, simulation, and so on operate. Attempting to create a driving agent would be difficult without having at least explored the "hidden" devices that make everything work. In the /smartcab/
top-level directory, there are two folders: /logs/
(which will be used later) and /smartcab/
. Open the /smartcab/
folder and explore each Python file included to understand how the code is structured.
From agent.py
, consider the following flags:
'update_delay'
: continuous time (in seconds) between actions, default is 2.0 seconds 'display'
: set to False to disable the GUI if PyGame is enabled 'log_metrics'
: set to True to log trial and simulation results to /logs'learning'
: set to True to force the Learning Agent to use the Q-Learning table'enforce_deadline'
: set to True to enforce the deadline metric'alpha'
: continuous value for the learning rate from this equation$\hspace{80px} Q(state, action) = (1 - \alpha) * Q(state, action) + \alpha * Reward$
which determines what percent of the newest reward will be used to update the Q-Learning table
'epsilon'
: continuos value for the exploration factor. I ultimately landed on the following
form of the sigmoid function for this:$\hspace{80px}\epsilon = 1 - \frac{1}{1+e^{-.02*(num\_trials-250)}}$
where the factors 0.02 and 250 where arrived at through repeated testing. Epsilon starts at 1 and
as the number of trials increases the sigmoid function gradually decreases epsilon. In the choose_action
method, epsilon governs whether the Learning Agent takes a random action as it's next step or uses
the Q-Learning table to govern the next action it takes. As epsilon decreases, this choice transitions from
randomly chosen action to learned action.
From environment.py
:
From simulator.py
:
'render_text()'
- " This is the non-GUI render display of the simulation. "
This function determines updates the various metrics according to what action was taken and
then determines violations and rewards.
It also determines if any accidents occurred and monitors the time and deadline as well as
monitors that status of the agent: idling, driving, learning.
'render()'
- " This is the GUI render display of the simulation. "
This updates the display to show various positions of cars, traffic light statuses, and environment.
From planner.py
:
'planner.py'
tests for whether destination is carinally East-West of location first.The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of None
, (do nothing) 'Left'
(turn left), 'Right'
(turn right), or 'Forward'
(go forward). For the first implementation, I'll use the 'choose_action()'
agent function to make the driving agent randomly choose one of these actions. Note that this method has access to several class variables that will help create this functionality, such as 'self.learning'
and 'self.valid_actions'
. Once implemented, I can run the agent file and simulation briefly to confirm that the driving agent is taking a random action each time step.
To obtain results from the initial simulation, I will need to adjust following flags:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.Optionally, I will disable the visual simulation (which can make the trials go faster) by setting the 'display'
flag to False
.
Once we have successfully completed the initial simulation (there should have been 20 training trials and 10 testing trials), run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!
# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')
Using the visualization above that was produced from this simulation, here's an analysis with observations about the driving agent:
The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment. Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually learn what action it should take when occupying a state. The condition of 'if state then action'
for each state is called a policy, and is ultimately what the driving agent is expected to learn. Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about!
Inspecting the 'build_state()'
agent function shows that the driving agent is given the following data from the environment:
'waypoint'
, which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.'inputs'
, which is the sensor data from the Smartcab. It includes 'light'
, the color of the light.'left'
, the intended direction of travel for a vehicle to the Smartcab's left. Returns None
if no vehicle is present.'right'
, the intended direction of travel for a vehicle to the Smartcab's right. Returns None
if no vehicle is present.'oncoming'
, the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None
if no vehicle is present.'deadline'
, which is the number of actions remaining for the Smartcab to reach the destination before running out of time.Which features available to the agent are most relevant for learning both safety and efficiency? Why are these features appropriate for modeling the Smartcab in the environment? If you did not choose some features, why are those features not appropriate?
For learning safety, the key input from environment.sense() is the status of the light, since this is the central driver for determining violations. Running a red light will greatly increase the chances of causing an accident as well. Either of these will induce significant negative rewards, reduction in reliability and ultimately failure.
As for left, right and oncoming inputs, these are all needed to determine whether the smartcab causes an accident which is layed out in the environment.act() method. Without these there can be no accidents. Specifically, though, the smartcab will never need to care about whether a car to the left is turning right, since any direction the cab is going can't have a collision with a car to the left turning right. Thus, we would only want to know if the car to the left is going forward or left. So instead of 4 states for that one (none, forward, left, right), we would only need 3.
For efficiency, we will want the agent to make as few turns as possible so as to get to the destination as quickly as possible within the deadline. Clearly, we'll need the waypoints for this as well as the deadline clock.
These features are important for modeling the Smartcab in the environment because the environment should be modeled independtly of the Smartcab, for starters, so that the Smartcab will have to learn to behave accroding to the rules of the environment as well as the behaviors of the other agents. In this way, the agent will "look" out into the environemnt to sense() whether the light is green or red. Also, these features are typical basics for this type of driving. Although simplified here, traffic lights oscillate between red and green on a timed basis and there are a variety of rules governing when it is acceptable to proceed and when we must wait. There are also rules for avoiding accidents and the like. All of this corresponds to a simplified environment that any real cab would encounter.
When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. That is to say, if you expect the driving agent to learn a policy for each state, you would need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the Smartcab:
('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day')
.
How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')
? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!
If a state is defined using the features selected from above, what would be the size of the state space? Given what we know about the evironment and how it is simulated, could the driving agent learn a policy for each possible state within a reasonable number of training trials?
Well, for next_waypoint there are 3 possibilites. For the traffic light there are two possibilities. Then for left, right and oncoming there are 4 possibilites each: none, forward, left, right. This would then be 3 x 2 x 4 x 4 x 4 = 384. Then we would need the smartcab to visit each of these at least 3-4 times to learn what the correct action should be. So we'd need to perform at least 1152-1536 actions to fully populate the state space.
However, I won't include the deadline here as it will cause an explosion in the size of the state space. From Environment.reset() we see that there are a minimum of 25 time steps allowed for each trial. This could result in requiring more than 38,000 actions to fully populate the Q-Learning table to get desirable results.
This would take far too long and require far too many training trials.
For the second implementation, I'll update the 'build_state()'
agent function. With the justification I've provided above, I will now set the 'state'
variable to a tuple of all the features necessary for Q-Learning.
Here, I will define state as state = (waypoint, str(inputs))
, using the waypoints and the actions but not the deadline.
The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible. For this project, I will be implementing a decaying, $\epsilon$-greedy Q-learning algorithm with no discount factor according to this formula:
$\hspace{50px} Q(state, action) = (1 - \alpha) * Q(state, action) + \alpha * Reward$
Here, the agent attribute self.Q
is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q
dictionary, and each value will then be another dictionary that holds the action and Q-value (reward). Here is an example:
{ 'state-1': {
'action-1' : Qvalue-1,
'action-2' : Qvalue-2,
...
},
'state-2': {
'action-1' : Qvalue-1,
...
},
...
}
Furthermore, note that I will be using a decaying $\epsilon$ (exploration) factor as described above:
$\hspace{50px}\epsilon = 1 - \frac{1}{1+e^{-.02*(num\_trials-250)}}$
Hence, as the number of trials increases, $\epsilon$ will decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior. Additionally, The agent will be tested on what it has learned after $\epsilon$ has passed a certain threshold (the default threshold is 0.05). For this initial Q-Learning implementation, I will be implementing a linear decaying function for $\epsilon$ (see below).
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.In addition, use the following decay function for $\epsilon$:
$$ \epsilon_{t+1} = \epsilon_{t} - 0.05, \hspace{10px}\textrm{for trial number } t$$If you have difficulty getting your implementation to work, try setting the 'verbose'
flag to True
to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!
Once you have successfully completed the initial Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!
# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')
Analysis of the visualization above that was produced from the default Q-Learning simulation. Note that the simulation produced the Q-table in a text file which can help make observations about the agent's learning.
The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it learn, but never let it act.
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.'optimized'
- Set this to 'True'
to tell the driving agent you are performing an optimized version of the Q-Learning implementation.Additional flags that can be adjusted as part of optimizing the Q-Learning agent:
'n_test'
- Set this to some positive number (previously 10) to perform that many testing trials.'alpha'
- Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.'epsilon'
- Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.'tolerance'
- set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.Furthermore, I will test several different decaying functions for $\epsilon$ (the exploration factor):
$$ \epsilon = a^t, \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = \frac{1}{t^2}\hspace{50px}\epsilon = e^{-at}, \textrm{for } 0 < a < 1 \hspace{50px} \epsilon = \cos(at), \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = 1 - \frac{1}{1+e^{-at}}$$I've also tried a decaying function for $\alpha$ (the learning rate) without much success.
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')
Analysis of the visualization above that was produced from the improved Q-Learning simulation:
$\hspace{60px} \epsilon = 1 - \frac{1}{1 + e^{-0.02(t - 250)}}$
sim_improved-learning.txt
file that accompanies this last run. With nearly every state we can see that the Q-learner is learning policies consistant with the reward/penalty logic established in environment.py. For instance, it has learnt to stop at nearly every light and to even be able to turn right under appropriate conditions on a red.Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, we can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, we can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to our advantage for verifying whether the policy your driving agent has learned is the correct one, or if it is a suboptimal policy.
Here are a few examples (using the states I've defined) of what an optimal policy for this problem would look like. I'm using the 'sim_improved-learning.txt'
text file to see the results of the improved Q-Learning algorithm.
For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?
The following highlights from the above file have the format:
- (state)
-- action-1 : Qvalue-1,
-- action-2 : Qvalue-2,
-- action-3 : Qvalue-3,
-- action-4 : Qvalue-4
('forward', "{'light': 'green', 'oncoming': 'left', 'right': None, 'left': None}")
-- forward : 1.85
-- right : 0.21
-- None : 0.12
-- left : 0.23
('right', "{'light': 'red', 'oncoming': 'forward', 'right': 'left', 'left': None}")
-- forward : -10.36
-- right : 1.96
-- None : 0.96
-- left : -10.23
('left', "{'light': 'green', 'oncoming': 'forward', 'right': None, 'left': None}")
-- forward : -0.01
-- right : 1.14
-- None : -5.54
-- left : -19.87
'gamma'
¶For this project, as part of the Q-Learning algorithm, I did not use the discount factor, 'gamma'
in the implementation. Including future rewards in the algorithm is used to aid in propogating positive rewards backwards from a future state to the current state. Essentially, if the driving agent is given the option to make several actions to arrive at different states, including future rewards will bias the agent towards states that could provide even more rewards. An example of this would be the driving agent moving towards a goal: With all actions and rewards equal, moving towards the goal would theoretically yield better rewards if there is an additional reward for reaching the goal. However, even though in this project, the driving agent is trying to reach a destination in the allotted time, including future rewards will not benefit the agent. In fact, if the agent were given many trials to learn, it could negatively affect Q-values!
Since the smartcab doesn't know where it is or where the destination is and since there's a lot of randomness introduced early on in the training, adding future states would make no sense, because that destination won't exist anymore and the agent will be compelled to find it regardless via the future state input into it's reward calculation. Using the future rewards component of the learner would not allow the agent to learn an appropriate policy as it would push the agent to always prefer the future state rather than learn from its immediate environment.
Also, since the destination is constantly changing, the future actions to chose from would be inappropriate as they would only be good for trying to get to the same destination from the same starting point. So using any of those future actions would negatively impact any learner for any other destination/starting point pair.