Train a Smartcab to Drive¶

John Kinstler - 04/19/17¶

Project files at https://github.com/m00nd00r/Smart-Cab ¶

In this project, I will construct an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: Safety and Reliability. A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable. Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.

Safety and Reliability are measured using a letter-grade system as follows:

Grade	Safety	Reliability
A+	Agent commits no traffic violations, and always chooses the correct action.	Agent reaches the destination in time for 100% of trips.
A	Agent commits few minor traffic violations, such as failing to move on a green light.	Agent reaches the destination on time for at least 90% of trips.
B	Agent commits frequent minor traffic violations, such as failing to move on a green light.	Agent reaches the destination on time for at least 80% of trips.
C	Agent commits at least one major traffic violation, such as driving through a red light.	Agent reaches the destination on time for at least 70% of trips.
D	Agent causes at least one minor accident, such as turning left on green with oncoming traffic.	Agent reaches the destination on time for at least 60% of trips.
F	Agent causes at least one major accident, such as driving through a red light with cross-traffic.	Agent fails to reach the destination on time for at least 60% of trips.

To assist evaluating these important metrics, a visualization code will be used later on in the project. The code in the cell below is used to import this code which is necessary for the upcoming analysis.

# Import the visualization code
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

Understanding the World¶

Before starting to work on implementing my driving agent, it's necessary to first understand the world (environment) which the Smartcab and driving agent work in. One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates. To begin, simply run the agent.py agent code in it's No Learning state -- to do this open agent.py, scroll to the bottom to the run method, comment the active lines of code within this method and uncomment the subsequent lines as per the instructions for each active line of code.

Then save and run the code either in a Jupyter Notebook from the top-level directory where this notebook is located using:

'run smartcab/agent.py'

or from the same top-level directory in a terminal window using:

'python smartcab/agent.py'

Let the resulting simulation run for some time to see the various working components. Note that in the visual simulation (if enabled), the white vehicle is the Smartcab.

Observations 1¶

When running the default agent.py agent code, some questions to consider:

Does the Smartcab move at all during the simulation?
What kind of rewards is the driving agent receiving?
How does the light changing color affect the rewards?

Answer:

The white smartcab does not move very often; over about 5 minutes the cab remains in one spot without moving, but is then randomly placed somewhere else.

The warning "!!Agent state not updated!" never changes. Underneath the heading Training Trial is the unchanging red warning phrase "Previous Trial: Failure". The rest of the grid changes with each update.

Underneath "!!Agent state not updated!" is printed the phrase "No action taken." that is either colored red when there is a negative reward for not going through a green light, or green for sitting idle during a red light. It is followed by the value of the positive or negative reward. If the light is green and the smartcab acts correctly without causing a violation or an accident there is a positive reward. There is also an additional reward for following the waypoint. Not following the waypoint when acting on a green can still yield positive reward.

Underneath that phrase is another phrase that describes the action the cab took and is colored red, yellow, or green depending on the appropriateness of the action.

Underneath that phrase is the statement "Agent not enforced to meet deadline."

Understanding the Code¶

In addition to understanding the world, it is also necessary to understand the code itself that governs how the world, simulation, and so on operate. Attempting to create a driving agent would be difficult without having at least explored the "hidden" devices that make everything work. In the /smartcab/ top-level directory, there are two folders: /logs/ (which will be used later) and /smartcab/. Open the /smartcab/ folder and explore each Python file included to understand how the code is structured.

Observations 2¶

From agent.py, consider the following flags:

'update_delay': continuous time (in seconds) between actions, default is 2.0 seconds
'display' : set to False to disable the GUI if PyGame is enabled
'log_metrics' : set to True to log trial and simulation results to /logs
'learning' : set to True to force the Learning Agent to use the Q-Learning table
'enforce_deadline' : set to True to enforce the deadline metric
'alpha' : continuous value for the learning rate from this equation

$\hspace{80px} Q(state, action) = (1 - \alpha) * Q(state, action) + \alpha * Reward$

which determines what percent of the newest reward will be used to update the Q-Learning table

'epsilon' : continuos value for the exploration factor. I ultimately landed on the following form of the sigmoid function for this:

$\hspace{80px}\epsilon = 1 - \frac{1}{1+e^{-.02*(num\_trials-250)}}$
where the factors 0.02 and 250 where arrived at through repeated testing. Epsilon starts at 1 and as the number of trials increases the sigmoid function gradually decreases epsilon. In the choose_action method, epsilon governs whether the Learning Agent takes a random action as it's next step or uses the Q-Learning table to govern the next action it takes. As epsilon decreases, this choice transitions from randomly chosen action to learned action.

From environment.py:

Environment.act() is called when an agent performs an action.

From simulator.py:

'render_text()' - " This is the non-GUI render display of the simulation. "

              This function determines updates the various metrics according to what action was taken and
              then determines violations and rewards. 
              It also determines if any accidents occurred and monitors the time and deadline as well as 
              monitors that status of the agent: idling, driving, learning.

'render()' - " This is the GUI render display of the simulation. "

          This updates the display to show various positions of cars, traffic light statuses, and environment.

From planner.py:

'planner.py' tests for whether destination is carinally East-West of location first.

Implement a Basic Driving Agent¶

The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of None, (do nothing) 'Left' (turn left), 'Right' (turn right), or 'Forward' (go forward). For the first implementation, I'll use the 'choose_action()' agent function to make the driving agent randomly choose one of these actions. Note that this method has access to several class variables that will help create this functionality, such as 'self.learning' and 'self.valid_actions'. Once implemented, I can run the agent file and simulation briefly to confirm that the driving agent is taking a random action each time step.

Basic Agent Simulation Results¶

To obtain results from the initial simulation, I will need to adjust following flags:

'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - Set this to True to log the simluation results as a .csv file in /logs/.
'n_test' - Set this to '10' to perform 10 testing trials.

Optionally, I will disable the visual simulation (which can make the trials go faster) by setting the 'display' flag to False.

Once we have successfully completed the initial simulation (there should have been 20 training trials and 10 testing trials), run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')

visuals.py:74: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  data['average_reward'] = pd.rolling_mean(data['net_reward'] / (data['initial_deadline'] - data['final_deadline']), 10)
visuals.py:75: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  data['reliability_rate'] = pd.rolling_mean(data['success']*100, 10)  # compute avg. net reward with window=10
visuals.py:78: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  (data['initial_deadline'] - data['final_deadline']), 10)
visuals.py:80: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  (data['initial_deadline'] - data['final_deadline']), 10)
visuals.py:82: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  (data['initial_deadline'] - data['final_deadline']), 10)
visuals.py:84: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  (data['initial_deadline'] - data['final_deadline']), 10)
visuals.py:86: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=10,center=False).mean()
  (data['initial_deadline'] - data['final_deadline']), 10)

Observations 3¶

Using the visualization above that was produced from this simulation, here's an analysis with observations about the driving agent:

How frequently is the driving agent making bad decisions? How many of those bad decisions cause accidents?
- The driving agent makes bad decisions 46% of the time reducing down to 36%, with what looks to be about 10-15% of those bad decisions causing accidents.
Given that the agent is driving randomly, does the rate of reliabilty make sense?
- Here, reliability is defined as success rate, where a success is reaching the destination in time. On that basis alone, we might expect that there would or could be some successes, if only a few, given that it should be possible to randomly guess correctly some of the time. However, this entirely depends on how many actions it takes to actually reach the destination in time. Given that there are 4 possible choices for any given action, the number of possible different routes will become impossibly large quickly. So with even a small number of minimum actions required for a successful route, we should well assume that randomly acting will likely yield 0 successes and thus zero rate of reliability.
What kind of rewards is the agent receiving for its actions? Do the rewards suggest it has been penalized heavily?
- The agent is receiving negative rewards on average per every block of 10 trials. However, the level of negative rewards it's receiving is increasing (getting less negative) as seen in the increasing slope of the 10-Trial Rolling Average Reward Per Action graph. Since learning is disabled, we can see that even through chance the smartcab can still improve a bit.
As the number of trials increases, does the outcome of results change significantly?
- As the number of trials increases, we do see some overall improvement in outcomes, in that they aren't quite so bad.
Would this Smartcab be considered safe and/or reliable for its passengers? Why or why not?
- This smartcab would definitely not be safe in that it results in such a high percentage of accidents and violations. However, oddly, it is reliable in that it is reliably unsafe. You can safely bet that you will not be safe upon entering this cab.

Inform the Driving Agent¶

The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment. Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually learn what action it should take when occupying a state. The condition of 'if state then action' for each state is called a policy, and is ultimately what the driving agent is expected to learn. Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about!

Identify States¶

Inspecting the 'build_state()' agent function shows that the driving agent is given the following data from the environment:

'waypoint', which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.
'inputs', which is the sensor data from the Smartcab. It includes
- 'light', the color of the light.
- 'left', the intended direction of travel for a vehicle to the Smartcab's left. Returns None if no vehicle is present.
- 'right', the intended direction of travel for a vehicle to the Smartcab's right. Returns None if no vehicle is present.
- 'oncoming', the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None if no vehicle is present.
'deadline', which is the number of actions remaining for the Smartcab to reach the destination before running out of time.

Observations 4¶

Which features available to the agent are most relevant for learning both safety and efficiency? Why are these features appropriate for modeling the Smartcab in the environment? If you did not choose some features, why are those features not appropriate?

For learning safety, the key input from environment.sense() is the status of the light, since this is the central driver for determining violations. Running a red light will greatly increase the chances of causing an accident as well. Either of these will induce significant negative rewards, reduction in reliability and ultimately failure.

As for left, right and oncoming inputs, these are all needed to determine whether the smartcab causes an accident which is layed out in the environment.act() method. Without these there can be no accidents. Specifically, though, the smartcab will never need to care about whether a car to the left is turning right, since any direction the cab is going can't have a collision with a car to the left turning right. Thus, we would only want to know if the car to the left is going forward or left. So instead of 4 states for that one (none, forward, left, right), we would only need 3.

For efficiency, we will want the agent to make as few turns as possible so as to get to the destination as quickly as possible within the deadline. Clearly, we'll need the waypoints for this as well as the deadline clock.

These features are important for modeling the Smartcab in the environment because the environment should be modeled independtly of the Smartcab, for starters, so that the Smartcab will have to learn to behave accroding to the rules of the environment as well as the behaviors of the other agents. In this way, the agent will "look" out into the environemnt to sense() whether the light is green or red. Also, these features are typical basics for this type of driving. Although simplified here, traffic lights oscillate between red and green on a timed basis and there are a variety of rules governing when it is acceptable to proceed and when we must wait. There are also rules for avoiding accidents and the like. All of this corresponds to a simplified environment that any real cab would encounter.

Define a State Space¶

When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. That is to say, if you expect the driving agent to learn a policy for each state, you would need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the Smartcab:

('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day').

How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!

Observations 5¶

If a state is defined using the features selected from above, what would be the size of the state space? Given what we know about the evironment and how it is simulated, could the driving agent learn a policy for each possible state within a reasonable number of training trials?

Well, for next_waypoint there are 3 possibilites. For the traffic light there are two possibilities. Then for left, right and oncoming there are 4 possibilites each: none, forward, left, right. This would then be 3 x 2 x 4 x 4 x 4 = 384. Then we would need the smartcab to visit each of these at least 3-4 times to learn what the correct action should be. So we'd need to perform at least 1152-1536 actions to fully populate the state space.

However, I won't include the deadline here as it will cause an explosion in the size of the state space. From Environment.reset() we see that there are a minimum of 25 time steps allowed for each trial. This could result in requiring more than 38,000 actions to fully populate the Q-Learning table to get desirable results.

This would take far too long and require far too many training trials.

Update the Driving Agent State¶

For the second implementation, I'll update the 'build_state()' agent function. With the justification I've provided above, I will now set the 'state' variable to a tuple of all the features necessary for Q-Learning.

Here, I will define state as state = (waypoint, str(inputs)), using the waypoints and the actions but not the deadline.

Implement a Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the iterative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible. For this project, I will be implementing a decaying, $\epsilon$-greedy Q-learning algorithm with no discount factor according to this formula:

$\hspace{50px} Q(state, action) = (1 - \alpha) * Q(state, action) + \alpha * Reward$

Here, the agent attribute self.Q is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q dictionary, and each value will then be another dictionary that holds the action and Q-value (reward). Here is an example:

{ 'state-1': { 
    'action-1' : Qvalue-1,
    'action-2' : Qvalue-2,
     ...
   },
  'state-2': {
    'action-1' : Qvalue-1,
     ...
   },
   ...
}

Furthermore, note that I will be using a decaying $\epsilon$ (exploration) factor as described above:

$\hspace{50px}\epsilon = 1 - \frac{1}{1+e^{-.02*(num\_trials-250)}}$

Hence, as the number of trials increases, $\epsilon$ will decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior. Additionally, The agent will be tested on what it has learned after $\epsilon$ has passed a certain threshold (the default threshold is 0.05). For this initial Q-Learning implementation, I will be implementing a linear decaying function for $\epsilon$ (see below).

Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:

'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - Set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
'n_test' - Set this to '10' to perform 10 testing trials.
'learning' - Set this to 'True' to tell the driving agent to use your Q-Learning implementation.

In addition, use the following decay function for $\epsilon$:

$$ \epsilon_{t+1} = \epsilon_{t} - 0.05, \hspace{10px}\textrm{for trial number } t$$

If you have difficulty getting your implementation to work, try setting the 'verbose' flag to True to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the initial Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!

# Load the 'sim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('sim_default-learning.csv')

Observations 6¶

Analysis of the visualization above that was produced from the default Q-Learning simulation. Note that the simulation produced the Q-table in a text file which can help make observations about the agent's learning.

Are there any observations that are similar between the basic driving agent and the default Q-Learning agent?
- The similarity in observations is seen in the general increase in performance with each successive trial. The rolling rate of reliability and the rolling average reward per action are both increasing with noticably better performance with the addition of the Q-Learner.
Approximately how many training trials did the driving agent require before testing? Does that number make sense given the epsilon-tolerance?
- The smartcab received 20 training trials before testing began. This is the inverse of 0.05 which is what the epsilon decay rate was set to for this train/test run.
Is the decaying function implemented for $\epsilon$ (the exploration factor) accurately represented in the parameters panel?
- Epsilon is accurately represented in the parameters panel as a line decreasing in value from 0.95 (instead of 1.0 due to being decremented before being passed on in the first pass) to 0.0 with a slope of -0.05
As the number of training trials increased, did the number of bad actions decrease? Did the average reward increase?
- As the training trials increased the average number of bad actions decreased from 37% down to 24%, a much better gain than before. Also the rolling average reward per action generally increased as well.
How does the safety and reliability rating compare to the initial driving agent?
- The safety and reliability rating are the same as the initial driving agent due the learner not yet learning enough to surpass the success thresholds. Addtionally, this could likely be improved with the addition of more training trials before implementing test. So a slower $\epsilon$ decay is clearly needed as well.

Improve the Q-Learning Driving Agent¶

The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it learn, but never let it act.

Improved Q-Learning Simulation Results¶

To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:

'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
'log_metrics' - Set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
'learning' - Set this to 'True' to tell the driving agent to use your Q-Learning implementation.
'optimized' - Set this to 'True' to tell the driving agent you are performing an optimized version of the Q-Learning implementation.

Additional flags that can be adjusted as part of optimizing the Q-Learning agent:

'n_test' - Set this to some positive number (previously 10) to perform that many testing trials.
'alpha' - Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.
'epsilon' - Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.
'tolerance' - set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.

Furthermore, I will test several different decaying functions for $\epsilon$ (the exploration factor):

$$ \epsilon = a^t, \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = \frac{1}{t^2}\hspace{50px}\epsilon = e^{-at}, \textrm{for } 0 < a < 1 \hspace{50px} \epsilon = \cos(at), \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = 1 - \frac{1}{1+e^{-at}}$$

I've also tried a decaying function for $\alpha$ (the learning rate) without much success.

# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')

Observations 7¶

Analysis of the visualization above that was produced from the improved Q-Learning simulation:

What decaying function was used for epsilon (the exploration factor)?
- The decaying function used was the sigmoid:

$\hspace{60px} \epsilon = 1 - \frac{1}{1 + e^{-0.02(t - 250)}}$

Approximately how many training trials were needed for this agent before begining testing?
- 596 training trials were run before testing with 50 trials
What epsilon-tolerance and alpha (learning rate) were used? Why?
- epsilon-tolerance was set to 0.001 with a constant alpha = 0.8. Previously, a linearly decaying alpha of 1/num_trials was used along with a similar alpha decay to the epsilon decay above, but with testing it was discovered that constant alphas improved results considerably. Also it was discovered that the learner needed a longer tail to give it more time to fine tune the Q-table to give it more stable Reward/Action rate so that when it gets to testing it can take better advantage of all of those efforts. To allow the sigmoid function to extend it's tail out further, the tolerance was iteratively decreased until the safety score improved. With testing it's clear that a slower decay for epsilon noticeably improved the reliability, whereas a lower epsilon-tolerance gave the decay a longer tail, allowing the learner to take advantage of it's learning to fine tune it's policies towards the actions that receive the best rewards.
How much improvement was made with this Q-Learner when compared to the default Q-Learner from the attempt?
- The previous Q-learner produced safety/reliability of F/F whereas this learner produced A+/A. It was clear that the learner needed more time to discover more of the available states in the state space, so more training trials would be needed. Additionally, tolerance, epsilon, and alpha need to be extensively tuned to steer the learner into a successful policy.
Did the Q-Learner results show that this driving agent successfully learned an appropriate policy?
- This Q-learner successfully learned an appropriate policy which can be verified by examining the sim_improved-learning.txt file that accompanies this last run. With nearly every state we can see that the Q-learner is learning policies consistant with the reward/penalty logic established in environment.py. For instance, it has learnt to stop at nearly every light and to even be able to turn right under appropriate conditions on a red.
Are you satisfied with the safety and reliability ratings of the Smartcab?
- Although reliability is not an A+, this learner with these parameters consistantly produces A/A results or better.

Define an Optimal Policy¶

Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, we can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, we can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to our advantage for verifying whether the policy your driving agent has learned is the correct one, or if it is a suboptimal policy.

Observations 8¶

Here are a few examples (using the states I've defined) of what an optimal policy for this problem would look like. I'm using the 'sim_improved-learning.txt' text file to see the results of the improved Q-Learning algorithm.

For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy?

The following highlights from the above file have the format:

- (state)
    -- action-1 : Qvalue-1,
    -- action-2 : Qvalue-2,
    -- action-3 : Qvalue-3,
    -- action-4 : Qvalue-4

('forward', "{'light': 'green', 'oncoming': 'left', 'right': None, 'left': None}")
-- forward : 1.85
-- right : 0.21
-- None : 0.12
-- left : 0.23
- Above we see that the state has the waypoint as 'forward', the light green, an oncoming car turning left and no left or right cars. Under these conditions it should be acceptable to drive forward consistant with the waypoint. In fact we see that the best action to take from the Q-table for an optimal policy is forward.

('right', "{'light': 'red', 'oncoming': 'forward', 'right': 'left', 'left': None}")
-- forward : -10.36
-- right : 1.96
-- None : 0.96
-- left : -10.23
- In this example, we see that the next waypoint is right, the light is red, the right car is turning left and no left car. In this case it would be safe, legal and optimal for the smartcab to make a right turn on a red light. In fact, this action is the action that it learned for this state. Additionally, we can see that the next best action is wait, which is also consistent with our optimal policy

('left', "{'light': 'green', 'oncoming': 'forward', 'right': None, 'left': None}")
-- forward : -0.01
-- right : 1.14
-- None : -5.54
-- left : -19.87
- In this case, with the light green and no oncoming traffic, the optimal policy would be to follow the waypoint and turn left. However, we see that the smartcab has learned a sub-optimal policy by turning right in what would be a worse move than driving forward or waiting at the light.

Side Note: Future Rewards - Discount Factor, `'gamma'`¶

For this project, as part of the Q-Learning algorithm, I did not use the discount factor, 'gamma' in the implementation. Including future rewards in the algorithm is used to aid in propogating positive rewards backwards from a future state to the current state. Essentially, if the driving agent is given the option to make several actions to arrive at different states, including future rewards will bias the agent towards states that could provide even more rewards. An example of this would be the driving agent moving towards a goal: With all actions and rewards equal, moving towards the goal would theoretically yield better rewards if there is an additional reward for reaching the goal. However, even though in this project, the driving agent is trying to reach a destination in the allotted time, including future rewards will not benefit the agent. In fact, if the agent were given many trials to learn, it could negatively affect Q-values!

Since the smartcab doesn't know where it is or where the destination is and since there's a lot of randomness introduced early on in the training, adding future states would make no sense, because that destination won't exist anymore and the agent will be compelled to find it regardless via the future state input into it's reward calculation. Using the future rewards component of the learner would not allow the agent to learn an appropriate policy as it would push the agent to always prefer the future state rather than learn from its immediate environment.

Also, since the destination is constantly changing, the future actions to chose from would be inappropriate as they would only be good for trying to get to the same destination from the same starting point. So using any of those future actions would negatively impact any learner for any other destination/starting point pair.