Probability and Sampling
Contents
Probability and Sampling#
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datascience import *
2 from cs104 import *
3 import numpy as np
ModuleNotFoundError: No module named 'datascience'
1. Distributions#
Probability Distribution#
We can use probability rules to analytically write down the expected number of each possible value in order to create a probability distribution like the following
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 3
1 # Sums of all possible combinations of two dice rolls.
2 # (The first few entries illustrate how we constructed these combinations.)
----> 3 outcomes = make_array(1+1,1+2,2+1,1+3,2+2,3+1,5,5,5,5,6,6,6,6,6,
4 7,7,7,7,7,7,8,8,8,8,8,9,9,9,9,10,10,10,11,11,12)
5 outcome_bins = np.arange(1.5, 13.5, 1)
6 plot = Table().with_columns('Sum of two dice rolls', outcomes).hist(bins=outcome_bins)
NameError: name 'make_array' is not defined
Empirical Distribution#
dice = np.arange(1,7)
dice
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 dice = np.arange(1,7)
2 dice
NameError: name 'np' is not defined
Let’s roll the dice twice and add the values.
two_dice = np.random.choice(dice, 2)
print('two dice=', two_dice)
print('sum=', sum(two_dice))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 two_dice = np.random.choice(dice, 2)
2 print('two dice=', two_dice)
3 print('sum=', sum(two_dice))
NameError: name 'np' is not defined
Let’s put this together in a function that simulate
can use as an input.
def sum_two_dice():
dice = np.arange(1,7)
two_dice = np.random.choice(dice, 2)
return sum(two_dice)
Use simulate
(from our inference library) to create an empirical distribution.
def simulate(make_one_outcome, num_trials):
"""
Return an array of num_trials values, each
of which was created by calling make_one_outcome().
"""
outcomes = make_array()
for i in np.arange(0, num_trials):
outcome = make_one_outcome()
outcomes = np.append(outcomes, outcome)
return outcomes
num_trials = 10
simulate(sum_two_dice, num_trials)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 2
1 num_trials = 10
----> 2 simulate(sum_two_dice, num_trials)
Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
1 def simulate(make_one_outcome, num_trials):
2 """
3 Return an array of num_trials values, each
4 of which was created by calling make_one_outcome().
5 """
----> 6 outcomes = make_array()
7 for i in np.arange(0, num_trials):
8 outcome = make_one_outcome()
NameError: name 'make_array' is not defined
num_trials = 2000
all_outcomes = simulate(sum_two_dice, num_trials)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 2
1 num_trials = 2000
----> 2 all_outcomes = simulate(sum_two_dice, num_trials)
Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
1 def simulate(make_one_outcome, num_trials):
2 """
3 Return an array of num_trials values, each
4 of which was created by calling make_one_outcome().
5 """
----> 6 outcomes = make_array()
7 for i in np.arange(0, num_trials):
8 outcome = make_one_outcome()
NameError: name 'make_array' is not defined
simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
simulated_results
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
2 simulated_results
NameError: name 'Table' is not defined
plot = simulated_results.hist(bins=outcome_bins)
plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials));
plot.set_ylim(0,0.175)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 plot = simulated_results.hist(bins=outcome_bins)
2 plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials));
3 plot.set_ylim(0,0.175)
NameError: name 'simulated_results' is not defined
Law of Averages#
In our simulation, we have one parameter that we have the ability to control num_trials
. Does this parameter matter?
To find out, we can write a function that takes as input the num_trials
parameter.
def simulate_and_plot_summing_two_dice(num_trials):
"""
Simulates rollowing two dice and repeats num_trials times, and
Plots the empirical distribution
"""
all_outcomes = simulate(sum_two_dice, num_trials)
simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
outcome_bins = np.arange(1.5, 13.5, 1)
plot = simulated_results.hist(bins=outcome_bins)
plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials))
plot.set_ylim(0,0.18)
simulate_and_plot_summing_two_dice(2000)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 1
----> 1 simulate_and_plot_summing_two_dice(2000)
Cell In[12], line 6, in simulate_and_plot_summing_two_dice(num_trials)
1 def simulate_and_plot_summing_two_dice(num_trials):
2 """
3 Simulates rollowing two dice and repeats num_trials times, and
4 Plots the empirical distribution
5 """
----> 6 all_outcomes = simulate(sum_two_dice, num_trials)
7 simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
9 outcome_bins = np.arange(1.5, 13.5, 1)
Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
1 def simulate(make_one_outcome, num_trials):
2 """
3 Return an array of num_trials values, each
4 of which was created by calling make_one_outcome().
5 """
----> 6 outcomes = make_array()
7 for i in np.arange(0, num_trials):
8 outcome = make_one_outcome()
NameError: name 'make_array' is not defined
Here are a couple plots for different numbers of trials:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 1
----> 1 with Figure(1,3, sharey=True):
2 simulate_and_plot_summing_two_dice(100)
3 simulate_and_plot_summing_two_dice(500)
NameError: name 'Figure' is not defined
interact(simulate_and_plot_summing_two_dice, num_trials = Slider(1,1000))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 interact(simulate_and_plot_summing_two_dice, num_trials = Slider(1,1000))
NameError: name 'Slider' is not defined
Here is an animation also showing how the number of trials impacts the result for a wider range of values.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[17], line 8
2 for num_trials in [10,20,30,40,50,60, 70, 80, 90,
3 100, 200, 300, 400, 500, 600, 700, 800, 900,
4 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000,
5 10000]:
6 yield locals()
----> 8 animate(simulate_and_plot_summing_two_dice, gen, default_mode='reflect', show_params=False, interval=200)
NameError: name 'animate' is not defined
2. Random Sampling: Florida Votes in 2016#
Load data for voting in Florida in 2016. These give us the true parameters if we were able to poll every person who would turn out to vote:
Proportion voting for (Trump, Clinton, Johnson, other) = (0.49, 0.478, 0.022, 0.01)
Raw counts:
Trump: 4,617,886
Clinton: 4,504,975
Johnson: 207,043
Other: 90,135
Data is based on the actual votes case in the election.
votes = Table().read_table('data/florida_2016.csv')
votes = votes.with_column('Vote', votes.apply(make_array("Trump", "Clinton", "Johnson", "Other").item, "Vote"))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 votes = Table().read_table('data/florida_2016.csv')
2 votes = votes.with_column('Vote', votes.apply(make_array("Trump", "Clinton", "Johnson", "Other").item, "Vote"))
NameError: name 'Table' is not defined
votes.show(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 1
----> 1 votes.show(5)
NameError: name 'votes' is not defined
Here’s the total number of votes cast in the election.
votes.num_rows
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 1
----> 1 votes.num_rows
NameError: name 'votes' is not defined
We can pick a “convenience sample”: the first 10 voters who show up in line.
votes.take(np.arange(10))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 1
----> 1 votes.take(np.arange(10))
NameError: name 'votes' is not defined
Since we are analyzing this after the election, we actually know the votes for the full population and we can compute the true parameter.
sum(votes.column('Vote') == 'Trump') / votes.num_rows
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[22], line 1
----> 1 sum(votes.column('Vote') == 'Trump') / votes.num_rows
NameError: name 'votes' is not defined
But suppose this is before the election and we actually can’t ask every person in the state how they will vote…
In that case, we can imagine we are a pollster, and sample 50 people.
We can use .sample(n)
to randomly sample n
rows from a table.
sample = votes.sample(50)
sample
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 1
----> 1 sample = votes.sample(50)
2 sample
NameError: name 'votes' is not defined
sum(sample.column('Vote') == 'Trump') / sample.num_rows
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[24], line 1
----> 1 sum(sample.column('Vote') == 'Trump') / sample.num_rows
NameError: name 'sample' is not defined
Let’s write functions to do this!
A function that takes a sample
A function that computes the statistic (proportion of the sample that voted for Trump).
def sample_votes(sample_size):
return votes.sample(sample_size)
def proportion_vote_trump(sample):
return sum(sample.column('Vote') == 'Trump') / sample.num_rows
sample = sample_votes(100)
proportion_vote_trump(sample)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[27], line 1
----> 1 sample = sample_votes(100)
2 proportion_vote_trump(sample)
Cell In[25], line 2, in sample_votes(sample_size)
1 def sample_votes(sample_size):
----> 2 return votes.sample(sample_size)
NameError: name 'votes' is not defined
proportion_vote_trump(sample_votes(100))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[28], line 1
----> 1 proportion_vote_trump(sample_votes(100))
Cell In[25], line 2, in sample_votes(sample_size)
1 def sample_votes(sample_size):
----> 2 return votes.sample(sample_size)
NameError: name 'votes' is not defined
proportion_vote_trump(sample_votes(1000))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[29], line 1
----> 1 proportion_vote_trump(sample_votes(1000))
Cell In[25], line 2, in sample_votes(sample_size)
1 def sample_votes(sample_size):
----> 2 return votes.sample(sample_size)
NameError: name 'votes' is not defined
So far, we’ve been using a simulate
function. Let’s extend this to a function that can also take a sample size. We’ll call this function simulate_sample_statistic
.
def simulate_sample_statistic(make_one_sample, sample_size,
compute_sample_statistic, num_trials):
"""
Simulates num_trials sampling steps and returns an array of the
statistic for those samples. The parameters are:
- make_one_sample: a function that takes an integer n and returns a
sample as an array of n elements.
- sample_size: the size of the samples to use in the simulation.
- compute_statistic: a function that takes a sample as
an array and returns the statistic for that sample.
- num_trials: the number of simulation steps to perform.
"""
simulated_statistics = make_array()
for i in np.arange(0, num_trials):
simulated_sample = make_one_sample(sample_size)
sample_statistic = compute_sample_statistic(simulated_sample)
simulated_statistics = np.append(simulated_statistics, sample_statistic)
return simulated_statistics
Let’s use our simulation algorithm to create an empirical distribution.
Suppose there are 1,000 polling companies and each uses a sample of 100 people.
num_trials = 1000 #1,000 polling companies
sample_size = 100 #100 people sampled by each polling company
all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
proportion_vote_trump, num_trials)
all_outcomes
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[31], line 4
1 num_trials = 1000 #1,000 polling companies
2 sample_size = 100 #100 people sampled by each polling company
----> 4 all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
5 proportion_vote_trump, num_trials)
6 all_outcomes
Cell In[30], line 18, in simulate_sample_statistic(make_one_sample, sample_size, compute_sample_statistic, num_trials)
1 def simulate_sample_statistic(make_one_sample, sample_size,
2 compute_sample_statistic, num_trials):
3 """
4 Simulates num_trials sampling steps and returns an array of the
5 statistic for those samples. The parameters are:
(...)
15 - num_trials: the number of simulation steps to perform.
16 """
---> 18 simulated_statistics = make_array()
19 for i in np.arange(0, num_trials):
20 simulated_sample = make_one_sample(sample_size)
NameError: name 'make_array' is not defined
simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
plot = simulated_results.hist()
title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)
plot.set_title(title)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[32], line 1
----> 1 simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
2 plot = simulated_results.hist()
4 title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)
NameError: name 'Table' is not defined
Let’s make a function with our two free parameters, num_trials
and sample_size
.
def simulate_and_plot_trump_pollster(num_trials, sample_size):
all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
proportion_vote_trump, num_trials)
simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
plot = simulated_results.hist(bins=np.arange(0.3,0.71,0.025))
title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)
plot.set_title(title)
Here are a few choices for parameters. Notice how each impacts the resulting histogram.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[34], line 1
----> 1 with Figure(2,2, sharey=True, sharex=True):
2 import matplotlib.pyplot as plots
3 simulate_and_plot_trump_pollster(100, 200)
NameError: name 'Figure' is not defined
interact(simulate_and_plot_trump_pollster,
num_trials = Choice(make_array(1,10,100,1000,5000)),
sample_size = Choice(make_array(1,10,100,1000,5000)))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[36], line 2
1 interact(simulate_and_plot_trump_pollster,
----> 2 num_trials = Choice(make_array(1,10,100,1000,5000)),
3 sample_size = Choice(make_array(1,10,100,1000,5000)))
NameError: name 'Choice' is not defined
As another way to look at it, here’s an visualization showing the empirical distribution for a small sample size with varying numbers of trials.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[38], line 8
3 plots.xlim(0.3, 0.7)
4 plots.ylim(0,30)
7 interact(bounded_simulate_and_plot_trump_pollster,
----> 8 num_trials = Slider(10,10000),
9 sample_size = Fixed(100))
NameError: name 'Slider' is not defined
And here’s one show showing the empirical distribution for varying sample sizes.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[40], line 8
3 plots.xlim(0.3, 0.7)
4 plots.ylim(0,70)
7 interact(bounded_simulate_and_plot_trump_pollster,
----> 8 num_trials = Fixed(1000),
9 sample_size = Slider(10,5000))
NameError: name 'Fixed' is not defined
And here’s an animation showing the empirical distribution for varying sample sizes.
Big picture questions sampling:
Why wouldn’t we always just take really big of samples since they converge to the true distribution?
Big picture questions simulations:
What are we abstracting away when we’re writing code? What are we re-using over and over?