Inference with Statistics
Contents
Inference with Statistics#
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datascience import *
2 from cs104 import *
3 import numpy as np
ModuleNotFoundError: No module named 'datascience'
1. Snoopy’s Fleet#
We will create a simulation for the empirical distribution of our chosen statistic.
At true inference time, we do not know N. However, let’s create N now so that we can evaluate how “good” our chosen models and statistics could be.
N = 300
population = np.arange(1, N+1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 population = np.arange(1, N+1)
NameError: name 'np' is not defined
Suppose we didn’t know N. Then all we could do is observe samples of planes flying overhead.
We’ll use our model assumption: Plane numbers are uniform random sample drawn with replacement from 1 to N.
def sample_plane_fleet(sample_size):
return np.random.choice(population, sample_size)
sample = sample_plane_fleet(10)
sample
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 sample = sample_plane_fleet(10)
2 sample
Cell In[5], line 2, in sample_plane_fleet(sample_size)
1 def sample_plane_fleet(sample_size):
----> 2 return np.random.choice(population, sample_size)
NameError: name 'np' is not defined
Option 1: Sample Statistic is the max plane number#
Let’s choose a stastic as the max of as sample. Luckily, max is simple and we don’t have to write a new function here. We can use Python’s built-in function!
max(sample)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 max(sample)
NameError: name 'sample' is not defined
Let’s put this all together using our simulation algorithm.
sample_size = 10
num_trials = 1000
all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size,
max, num_trials)
results = Table().with_columns("Statistic Option 1: Max Sample", all_outcomes)
plot = results.hist(bins=np.arange(1, 2 * N, 10))
plot.set_title("Empirical distribution\n"+
"Sample Size ="+str(sample_size)+
"\nNum trials ="+str(num_trials))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size,
2 max, num_trials)
4 results = Table().with_columns("Statistic Option 1: Max Sample", all_outcomes)
5 plot = results.hist(bins=np.arange(1, 2 * N, 10))
NameError: name 'simulate_sample_statistic' is not defined
Let’s generalize a bit and create function that takes the statistic and runs the whole experiment.
Option 2: Sample Statistic is Twice the Mean#
Let’s generalize a bit and create function that takes the statistic and runs the whole experiment
def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
"""
Simulates multiple trials of a statistic on our simulated fleet of N planes
Inputs
- sample_size: number of planes we see in our sample
- compute_sample_statistic: a function that takes an array
and returns a summary statistic (e.g. max)
- num_trials: number of simulation trials
Output
Histogram of the results
"""
all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size,
compute_sample_statistic, num_trials)
results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__,
all_outcomes)
plot = results.hist(bins=np.arange(1, 2 * N, 10))
plot.set_title("Empirical distribution\n"+
"Sample Size ="+str(sample_size)+
"\nNum trials ="+str(num_trials))
# Red line is the True N
plot.line(x=N, color='r', linestyle='--')
planes_empirical_statistic_distribution(10, max, 1000)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 planes_empirical_statistic_distribution(10, max, 1000)
Cell In[10], line 14, in planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials)
1 def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
2 """
3 Simulates multiple trials of a statistic on our simulated fleet of N planes
4
(...)
12 Histogram of the results
13 """
---> 14 all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size,
15 compute_sample_statistic, num_trials)
17 results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__,
18 all_outcomes)
19 plot = results.hist(bins=np.arange(1, 2 * N, 10))
NameError: name 'simulate_sample_statistic' is not defined
Our second option for the statistic:
def twice_mean(sample):
"""Twice the sample mean"""
return 2 * np.mean(sample)
planes_empirical_statistic_distribution(10, twice_mean, 1000)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 5
2 """Twice the sample mean"""
3 return 2 * np.mean(sample)
----> 5 planes_empirical_statistic_distribution(10, twice_mean, 1000)
Cell In[10], line 14, in planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials)
1 def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
2 """
3 Simulates multiple trials of a statistic on our simulated fleet of N planes
4
(...)
12 Histogram of the results
13 """
---> 14 all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size,
15 compute_sample_statistic, num_trials)
17 results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__,
18 all_outcomes)
19 plot = results.hist(bins=np.arange(1, 2 * N, 10))
NameError: name 'simulate_sample_statistic' is not defined
2.Effects of Sample Size and Simulation Rounds#
First interaction: Let’s just look at the number of trials just for our twice mean statistic with everything else held constant.
interact(planes_empirical_statistic_distribution,
sample_size=Fixed(20),
compute_sample_statistic=Fixed(twice_mean),
num_trials=Slider(1, 20010, 10))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 2
1 interact(planes_empirical_statistic_distribution,
----> 2 sample_size=Fixed(20),
3 compute_sample_statistic=Fixed(twice_mean),
4 num_trials=Slider(1, 20010, 10))
NameError: name 'Fixed' is not defined
Here are a few sample distrubtions with different numbers of trials.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 28
25 plot.line(x=N, color='r', linestyle='--')
26 plot.set_xlim(200,400)
---> 28 with Figure(1,3, sharey=True):
29 planes_empirical_statistic_distribution2(100, twice_mean, 10)
30 planes_empirical_statistic_distribution2(100, twice_mean, 100)
NameError: name 'Figure' is not defined
We can also examinge the impact of changing the sample size and N using the following function.
def visualize_distributions(N, sample_size, num_trials):
"""A single function to run our simulation for a given N, sample_size, and num_trials."""
import matplotlib.pyplot as plt
population = np.arange(1, N+1)
# Builds up our outcomes table one row at a time. We do this to ensure
# we can apply both statistics to the same samples.
outcomes_table = Table(["Max", "2*Mean"])
for i in np.arange(num_trials):
sample = np.random.choice(population, sample_size)
outcomes_table.append(make_array(max(sample), 2 * np.mean(sample)))
plot = outcomes_table.hist(bins=np.arange(N/2, 3*N/2, 2))
plot.set_title("Empirical distribution (N="+str(N)+")\n"+
"Sample size ="+str(sample_size)+
"\nNum trials ="+str(num_trials))
# Red line is the True N
plot.line(x=N, color='r', linestyle='--')
interact(visualize_distributions,
N=Slider(100, 450, 50),
sample_size=Slider(10, 101, 1),
num_trials=Slider(10, 10100, 100))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[18], line 2
1 interact(visualize_distributions,
----> 2 N=Slider(100, 450, 50),
3 sample_size=Slider(10, 101, 1),
4 num_trials=Slider(10, 10100, 100))
NameError: name 'Slider' is not defined
Here’s a visualization of how sample size impacts the distribution of our statistic for random samples.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 24
21 for sample_size in np.arange(2,40,2):
22 yield locals()
---> 24 animate(visualize_distributions, gen, default_mode="reflect", interval=200)
NameError: name 'animate' is not defined
Here’s a visualization of how the number of trials impacts the distribution of our statistic for random samples.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[20], line 19
15 plot.set_ylim(0,0.07)
16 plot.line(x=N, color='r', linestyle='--')
---> 19 tens = make_array(10,20,30,40,50,60,70,80,90)
21 def gen():
22 N = 300
NameError: name 'make_array' is not defined
Here’s a visualization of how the parameter N impacts the distribution of our statistic for random samples.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[21], line 19
15 plot.set_ylim(0,0.07)
16 plot.line(x=N, color='r', linestyle='--')
---> 19 tens = make_array(10,20,30,40,50,60,70,80,90)
21 def gen():
22 sample_size=30
NameError: name 'make_array' is not defined
3. Mendel and Pea Flowers#
Looking ahead, we know we’ll be using our simulate sample statistc! So let’s build and check all these pieces.
simulate_sample_statistic(make_sample,
sample_size,
compute_statistic,
num_trials)
Observed sample: In Mendel’s one sample (his own garden), he had 929 second new generation pea plants, of which 709 had purple flowers. We compute the percent purple he observed:
observed_sample_size = 929
observed_purples = 709 / observed_sample_size * 100
observed_purples
76.31862217438106
Model: Mendel hypothesized (based on his preliminary theories of genetics) that he should have gotten 75% purple and 25% white.
hypothesized_purple_proportion = 0.75
hypothesized_proportions = make_array(hypothesized_purple_proportion,
1 - hypothesized_purple_proportion)
hypothesized_proportions
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[23], line 2
1 hypothesized_purple_proportion = 0.75
----> 2 hypothesized_proportions = make_array(hypothesized_purple_proportion,
3 1 - hypothesized_purple_proportion)
4 hypothesized_proportions
NameError: name 'make_array' is not defined
In the Python library reference, we see can use the function sample_proportions(sample_size, model_proportions).
sample = sample_proportions(observed_sample_size, hypothesized_proportions)
sample
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[24], line 1
----> 1 sample = sample_proportions(observed_sample_size, hypothesized_proportions)
2 sample
NameError: name 'sample_proportions' is not defined
Let’s put it into a function.
def flowers_make_sample(sample_size):
"""
Return the percents of purple flowers and white flowers in an array
"""
hypothesized_purple_proportion = 0.75
hypothesized_proportions = make_array(hypothesized_purple_proportion,
1 - hypothesized_purple_proportion)
sample = 100 * sample_proportions(sample_size, hypothesized_proportions)
return sample
flowers_make_sample(observed_sample_size)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[26], line 1
----> 1 flowers_make_sample(observed_sample_size)
Cell In[25], line 6, in flowers_make_sample(sample_size)
2 """
3 Return the percents of purple flowers and white flowers in an array
4 """
5 hypothesized_purple_proportion = 0.75
----> 6 hypothesized_proportions = make_array(hypothesized_purple_proportion,
7 1 - hypothesized_purple_proportion)
8 sample = 100 * sample_proportions(sample_size, hypothesized_proportions)
9 return sample
NameError: name 'make_array' is not defined
Each item in the array corresponds to the proportion of times it was sampled out of sample_size
times. So the proportion purple is
def stat_percent_purple(sample):
return sample.item(0)
stat_percent_purple(sample)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[27], line 4
1 def stat_percent_purple(sample):
2 return sample.item(0)
----> 4 stat_percent_purple(sample)
NameError: name 'sample' is not defined
Now let’s use our function simulate_sample_statistic
.
num_trials = 1000
all_outcomes = simulate_sample_statistic(flowers_make_sample, observed_sample_size,
stat_percent_purple, num_trials)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[28], line 3
1 num_trials = 1000
----> 3 all_outcomes = simulate_sample_statistic(flowers_make_sample, observed_sample_size,
4 stat_percent_purple, num_trials)
NameError: name 'simulate_sample_statistic' is not defined
results = Table().with_column('Percent purple flowers', all_outcomes)
results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[29], line 1
----> 1 results = Table().with_column('Percent purple flowers', all_outcomes)
2 results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
NameError: name 'Table' is not defined
Connecting these pieces together:
In Mendel’s model, he hypothesized getting purple flowers was like flipping a biased coin and getting heads 75% of the time.
We simulated outcomes under this hypothesis.
Now let’s check if the observed data (that there were 76.3% purple flowers in one sample, Mendel’s own garden) “fits” with the simulated outcomes under the model
plot = results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
plot.line(x=75,lw=4,linestyle="dashed")
plot.dot(observed_purples)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[30], line 1
----> 1 plot = results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
2 plot.line(x=75,lw=4,linestyle="dashed")
3 plot.dot(observed_purples)
NameError: name 'results' is not defined
differences_from_model = Table().with_column('abs(sample statistic - model parameter)',
abs(all_outcomes - 75))
diff_plot = differences_from_model.hist(title = 'Differences from Model \n Parameter (75% Purple)')
diff_plot.dot(abs(observed_purples - 75))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[31], line 1
----> 1 differences_from_model = Table().with_column('abs(sample statistic - model parameter)',
2 abs(all_outcomes - 75))
4 diff_plot = differences_from_model.hist(title = 'Differences from Model \n Parameter (75% Purple)')
5 diff_plot.dot(abs(observed_purples - 75))
NameError: name 'Table' is not defined
Once again, let’s use computation to abstract away the variables that we have the power to change, and the parts of the code that never change. Here is a new function to do that.
def pea_plants_simulation(observed_sample_size, observed_purples_count,
hypothesized_purple_percent, num_trials):
"""
Parameters:
- observed_sample_size: number of plants in our experiment
- observed_purples_count: count of plants in our experiment w/ purple flowers
- hypothesized_purple_proportion: our model parameter (hypothesis for
proportion of plants will w/ purple flowers).
- num_trials: number of simulation rounds to run
Outputs two plots:
- Empirical distribution of the percent of plants w/ purple flowers
from our simulation trials
- Empirical distribution of how far off each trial was from our hypothesized model
"""
observed_purples = 100 * observed_purples_count / observed_sample_size
hypothesized_proportions = make_array(hypothesized_purple_percent/100,
1 - hypothesized_purple_percent/100)
assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)
all_outcomes = make_array()
for i in np.arange(0, num_trials):
simulated_sample = 100 * sample_proportions(observed_sample_size, hypothesized_proportions)
outcome = simulated_sample.item(0)
all_outcomes = np.append(all_outcomes, outcome)
#plots
with Figure(1,2):
percent_purple = Table().with_column('% of purple flowers', all_outcomes)
pp_plot = percent_purple.hist(title='Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
pp_plot.dot(observed_purples)
distance_from_model = Table().with_column('abs(sample statistic - model parameter)',
abs(all_outcomes - hypothesized_purple_percent))
dist_plot = distance_from_model.hist(title='Differences from Model \n Parameter (' + str(hypothesized_purple_percent) + '% Purple)')
dist_plot.dot(abs(observed_purples - hypothesized_purple_percent))
Here is a scenario to match Mendel’s garden. 705 out of 929 plants had purple flowers, and he proposed 75% should be purple.
pea_plants_simulation(929, 705, 75, 1000)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[33], line 1
----> 1 pea_plants_simulation(929, 705, 75, 1000)
Cell In[32], line 17, in pea_plants_simulation(observed_sample_size, observed_purples_count, hypothesized_purple_percent, num_trials)
3 """
4 Parameters:
5 - observed_sample_size: number of plants in our experiment
(...)
13 - Empirical distribution of how far off each trial was from our hypothesized model
14 """
16 observed_purples = 100 * observed_purples_count / observed_sample_size
---> 17 hypothesized_proportions = make_array(hypothesized_purple_percent/100,
18 1 - hypothesized_purple_percent/100)
20 assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
21 assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)
NameError: name 'make_array' is not defined
Here is a different scenario. Suppose only 506 out of 929 plants had purple flowers, and but our model indicates that 75% should be purple.
pea_plants_simulation(929, 506, 75, 1000)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[34], line 1
----> 1 pea_plants_simulation(929, 506, 75, 1000)
Cell In[32], line 17, in pea_plants_simulation(observed_sample_size, observed_purples_count, hypothesized_purple_percent, num_trials)
3 """
4 Parameters:
5 - observed_sample_size: number of plants in our experiment
(...)
13 - Empirical distribution of how far off each trial was from our hypothesized model
14 """
16 observed_purples = 100 * observed_purples_count / observed_sample_size
---> 17 hypothesized_proportions = make_array(hypothesized_purple_percent/100,
18 1 - hypothesized_purple_percent/100)
20 assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
21 assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)
NameError: name 'make_array' is not defined
interact(pea_plants_simulation,
observed_sample_size = Fixed(929),
observed_purples_count = Slider(0,929),
hypothesized_purple_percent = Slider(0,100,1),
num_trials=Slider(10, 5000, 100))
#
#
#
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[36], line 2
1 interact(pea_plants_simulation,
----> 2 observed_sample_size = Fixed(929),
3 observed_purples_count = Slider(0,929),
4 hypothesized_purple_percent = Slider(0,100,1),
5 num_trials=Slider(10, 5000, 100))
7 #
8 #
9 #
NameError: name 'Fixed' is not defined