Inference with Statistics#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datascience import *
      2 from cs104 import *
      3 import numpy as np

ModuleNotFoundError: No module named 'datascience'

1. Snoopy’s Fleet#

We will create a simulation for the empirical distribution of our chosen statistic.

At true inference time, we do not know N. However, let’s create N now so that we can evaluate how “good” our chosen models and statistics could be.

N = 300

population = np.arange(1, N+1)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 population = np.arange(1, N+1)

NameError: name 'np' is not defined

Suppose we didn’t know N. Then all we could do is observe samples of planes flying overhead.

We’ll use our model assumption: Plane numbers are uniform random sample drawn with replacement from 1 to N.

def sample_plane_fleet(sample_size):
    return np.random.choice(population, sample_size)

sample = sample_plane_fleet(10)
sample

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 sample = sample_plane_fleet(10)
      2 sample

Cell In[5], line 2, in sample_plane_fleet(sample_size)
      1 def sample_plane_fleet(sample_size):
----> 2     return np.random.choice(population, sample_size)

NameError: name 'np' is not defined

Option 1: Sample Statistic is the max plane number#

Let’s choose a stastic as the max of as sample. Luckily, max is simple and we don’t have to write a new function here. We can use Python’s built-in function!

max(sample)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 max(sample)

NameError: name 'sample' is not defined

Let’s put this all together using our simulation algorithm.

sample_size = 10
num_trials = 1000

all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size, 
                                         max, num_trials)

results = Table().with_columns("Statistic Option 1: Max Sample", all_outcomes)
plot = results.hist(bins=np.arange(1, 2 * N, 10))
plot.set_title("Empirical distribution\n"+
               "Sample Size ="+str(sample_size)+ 
               "\nNum trials ="+str(num_trials))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size, 
      2                                          max, num_trials)
      4 results = Table().with_columns("Statistic Option 1: Max Sample", all_outcomes)
      5 plot = results.hist(bins=np.arange(1, 2 * N, 10))

NameError: name 'simulate_sample_statistic' is not defined

Let’s generalize a bit and create function that takes the statistic and runs the whole experiment.

Option 2: Sample Statistic is Twice the Mean#

Let’s generalize a bit and create function that takes the statistic and runs the whole experiment

def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
    """
    Simulates multiple trials of a statistic on our simulated fleet of N planes 
    
    Inputs 
        - sample_size: number of planes we see in our sample
        - compute_sample_statistic: a function that takes an array 
            and returns a summary statistic (e.g. max)
        - num_trials: number of simulation trials 
        
    Output 
        Histogram of the results 
    """
    all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size, 
                                             compute_sample_statistic, num_trials)
        
    results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__, 
                                   all_outcomes)
    plot = results.hist(bins=np.arange(1, 2 * N, 10))
    plot.set_title("Empirical distribution\n"+
               "Sample Size ="+str(sample_size)+ 
               "\nNum trials ="+str(num_trials))
    
    # Red line is the True N    
    plot.line(x=N, color='r', linestyle='--')    

planes_empirical_statistic_distribution(10, max, 1000)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 planes_empirical_statistic_distribution(10, max, 1000)

Cell In[10], line 14, in planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials)
      1 def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
      2     """
      3     Simulates multiple trials of a statistic on our simulated fleet of N planes 
      4     
   (...)
     12         Histogram of the results 
     13     """
---> 14     all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size, 
     15                                              compute_sample_statistic, num_trials)
     17     results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__, 
     18                                    all_outcomes)
     19     plot = results.hist(bins=np.arange(1, 2 * N, 10))

NameError: name 'simulate_sample_statistic' is not defined

Our second option for the statistic:

def twice_mean(sample):
    """Twice the sample mean"""
    return 2 * np.mean(sample)

planes_empirical_statistic_distribution(10, twice_mean, 1000)    

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 5
      2     """Twice the sample mean"""
      3     return 2 * np.mean(sample)
----> 5 planes_empirical_statistic_distribution(10, twice_mean, 1000)    

Cell In[10], line 14, in planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials)
      1 def planes_empirical_statistic_distribution(sample_size, compute_sample_statistic, num_trials):
      2     """
      3     Simulates multiple trials of a statistic on our simulated fleet of N planes 
      4     
   (...)
     12         Histogram of the results 
     13     """
---> 14     all_outcomes = simulate_sample_statistic(sample_plane_fleet, sample_size, 
     15                                              compute_sample_statistic, num_trials)
     17     results = Table().with_columns('Empirical Distribution, Statistic: ' + compute_sample_statistic.__name__, 
     18                                    all_outcomes)
     19     plot = results.hist(bins=np.arange(1, 2 * N, 10))

NameError: name 'simulate_sample_statistic' is not defined

2.Effects of Sample Size and Simulation Rounds#

First interaction: Let’s just look at the number of trials just for our twice mean statistic with everything else held constant.

interact(planes_empirical_statistic_distribution,
         sample_size=Fixed(20), 
         compute_sample_statistic=Fixed(twice_mean),
         num_trials=Slider(1, 20010, 10))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 interact(planes_empirical_statistic_distribution,
----> 2          sample_size=Fixed(20), 
      3          compute_sample_statistic=Fixed(twice_mean),
      4          num_trials=Slider(1, 20010, 10))

NameError: name 'Fixed' is not defined

Here are a few sample distrubtions with different numbers of trials.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 28
     25     plot.line(x=N, color='r', linestyle='--')  
     26     plot.set_xlim(200,400)
---> 28 with Figure(1,3, sharey=True):
     29     planes_empirical_statistic_distribution2(100, twice_mean, 10)
     30     planes_empirical_statistic_distribution2(100, twice_mean, 100)    

NameError: name 'Figure' is not defined

We can also examinge the impact of changing the sample size and N using the following function.

def visualize_distributions(N, sample_size, num_trials):
    """A single function to run our simulation for a given N, sample_size, and num_trials."""
    import matplotlib.pyplot as plt
    
    population = np.arange(1, N+1)
    
    # Builds up our outcomes table one row at a time.  We do this to ensure
    # we can apply both statistics to the same samples.
    outcomes_table = Table(["Max", "2*Mean"])
    
    for i in np.arange(num_trials):
        sample = np.random.choice(population, sample_size)
        outcomes_table.append(make_array(max(sample), 2 * np.mean(sample)))
        
    plot = outcomes_table.hist(bins=np.arange(N/2, 3*N/2, 2))
    plot.set_title("Empirical distribution (N="+str(N)+")\n"+
               "Sample size ="+str(sample_size)+ 
               "\nNum trials ="+str(num_trials))
    # Red line is the True N
    plot.line(x=N, color='r', linestyle='--')

interact(visualize_distributions, 
         N=Slider(100, 450, 50), 
         sample_size=Slider(10, 101, 1), 
         num_trials=Slider(10, 10100, 100))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 2
      1 interact(visualize_distributions, 
----> 2          N=Slider(100, 450, 50), 
      3          sample_size=Slider(10, 101, 1), 
      4          num_trials=Slider(10, 10100, 100))

NameError: name 'Slider' is not defined

Here’s a visualization of how sample size impacts the distribution of our statistic for random samples.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 24
     21     for sample_size in np.arange(2,40,2):
     22         yield locals()
---> 24 animate(visualize_distributions, gen, default_mode="reflect", interval=200)

NameError: name 'animate' is not defined

Here’s a visualization of how the number of trials impacts the distribution of our statistic for random samples.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 19
     15     plot.set_ylim(0,0.07)
     16     plot.line(x=N, color='r', linestyle='--')    
---> 19 tens = make_array(10,20,30,40,50,60,70,80,90)
     21 def gen():
     22     N = 300

NameError: name 'make_array' is not defined

Here’s a visualization of how the parameter N impacts the distribution of our statistic for random samples.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 19
     15     plot.set_ylim(0,0.07)
     16     plot.line(x=N, color='r', linestyle='--')    
---> 19 tens = make_array(10,20,30,40,50,60,70,80,90)
     21 def gen():
     22     sample_size=30

NameError: name 'make_array' is not defined

3. Mendel and Pea Flowers#

Looking ahead, we know we’ll be using our simulate sample statistc! So let’s build and check all these pieces.

simulate_sample_statistic(make_sample,
   sample_size,
   compute_statistic,
   num_trials)

Observed sample: In Mendel’s one sample (his own garden), he had 929 second new generation pea plants, of which 709 had purple flowers. We compute the percent purple he observed:

observed_sample_size = 929
observed_purples = 709 / observed_sample_size * 100
observed_purples

76.31862217438106

Model: Mendel hypothesized (based on his preliminary theories of genetics) that he should have gotten 75% purple and 25% white.

hypothesized_purple_proportion = 0.75
hypothesized_proportions = make_array(hypothesized_purple_proportion, 
                                      1 - hypothesized_purple_proportion) 
hypothesized_proportions

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 2
      1 hypothesized_purple_proportion = 0.75
----> 2 hypothesized_proportions = make_array(hypothesized_purple_proportion, 
      3                                       1 - hypothesized_purple_proportion) 
      4 hypothesized_proportions

NameError: name 'make_array' is not defined

In the Python library reference, we see can use the function sample_proportions(sample_size, model_proportions).

sample = sample_proportions(observed_sample_size, hypothesized_proportions)
sample

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 sample = sample_proportions(observed_sample_size, hypothesized_proportions)
      2 sample

NameError: name 'sample_proportions' is not defined

Let’s put it into a function.

def flowers_make_sample(sample_size): 
    """
    Return the percents of purple flowers and white flowers in an array
    """
    hypothesized_purple_proportion = 0.75
    hypothesized_proportions = make_array(hypothesized_purple_proportion, 
                                          1 - hypothesized_purple_proportion) 
    sample = 100 * sample_proportions(sample_size, hypothesized_proportions)
    return sample

flowers_make_sample(observed_sample_size)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 flowers_make_sample(observed_sample_size)

Cell In[25], line 6, in flowers_make_sample(sample_size)
      2 """
      3 Return the percents of purple flowers and white flowers in an array
      4 """
      5 hypothesized_purple_proportion = 0.75
----> 6 hypothesized_proportions = make_array(hypothesized_purple_proportion, 
      7                                       1 - hypothesized_purple_proportion) 
      8 sample = 100 * sample_proportions(sample_size, hypothesized_proportions)
      9 return sample

NameError: name 'make_array' is not defined

Each item in the array corresponds to the proportion of times it was sampled out of sample_size times. So the proportion purple is

def stat_percent_purple(sample): 
    return sample.item(0)

stat_percent_purple(sample)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 4
      1 def stat_percent_purple(sample): 
      2     return sample.item(0)
----> 4 stat_percent_purple(sample)

NameError: name 'sample' is not defined

Now let’s use our function simulate_sample_statistic.

num_trials = 1000

all_outcomes = simulate_sample_statistic(flowers_make_sample, observed_sample_size,
                                         stat_percent_purple, num_trials)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 3
      1 num_trials = 1000
----> 3 all_outcomes = simulate_sample_statistic(flowers_make_sample, observed_sample_size,
      4                                          stat_percent_purple, num_trials)

NameError: name 'simulate_sample_statistic' is not defined

results = Table().with_column('Percent purple flowers', all_outcomes)
results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 results = Table().with_column('Percent purple flowers', all_outcomes)
      2 results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))

NameError: name 'Table' is not defined

Connecting these pieces together:

In Mendel’s model, he hypothesized getting purple flowers was like flipping a biased coin and getting heads 75% of the time.
We simulated outcomes under this hypothesis.
Now let’s check if the observed data (that there were 76.3% purple flowers in one sample, Mendel’s own garden) “fits” with the simulated outcomes under the model

plot = results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
plot.line(x=75,lw=4,linestyle="dashed")
plot.dot(observed_purples)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 1
----> 1 plot = results.hist(title = 'Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
      2 plot.line(x=75,lw=4,linestyle="dashed")
      3 plot.dot(observed_purples)

NameError: name 'results' is not defined

differences_from_model = Table().with_column('abs(sample statistic - model parameter)', 
                                             abs(all_outcomes - 75))

diff_plot = differences_from_model.hist(title = 'Differences from Model \n Parameter (75% Purple)')
diff_plot.dot(abs(observed_purples - 75))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 differences_from_model = Table().with_column('abs(sample statistic - model parameter)', 
      2                                              abs(all_outcomes - 75))
      4 diff_plot = differences_from_model.hist(title = 'Differences from Model \n Parameter (75% Purple)')
      5 diff_plot.dot(abs(observed_purples - 75))

NameError: name 'Table' is not defined

Once again, let’s use computation to abstract away the variables that we have the power to change, and the parts of the code that never change. Here is a new function to do that.

def pea_plants_simulation(observed_sample_size, observed_purples_count, 
                          hypothesized_purple_percent, num_trials): 
    """
    Parameters:
        - observed_sample_size: number of plants in our experiment
        - observed_purples_count: count of plants in our experiment w/ purple flowers
        - hypothesized_purple_proportion: our model parameter (hypothesis for 
              proportion of plants will w/ purple flowers).
        - num_trials: number of simulation rounds to run
    Outputs two plots:
        - Empirical distribution of the percent of plants w/ purple flowers 
          from our simulation trials
        - Empirical distribution of how far off each trial was from our hypothesized model
    """
    
    observed_purples = 100 * observed_purples_count / observed_sample_size
    hypothesized_proportions = make_array(hypothesized_purple_percent/100, 
                                          1 - hypothesized_purple_percent/100)
    
    assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
    assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)
    
    all_outcomes = make_array()
    for i in np.arange(0, num_trials):
        simulated_sample = 100 * sample_proportions(observed_sample_size, hypothesized_proportions)
        outcome = simulated_sample.item(0)
        all_outcomes = np.append(all_outcomes, outcome)
    
    #plots
    with Figure(1,2):
        percent_purple = Table().with_column('% of purple flowers', all_outcomes)
        pp_plot = percent_purple.hist(title='Simulated Outcomes\n observed sample size=' + str(observed_sample_size))
        pp_plot.dot(observed_purples)

        distance_from_model = Table().with_column('abs(sample statistic - model parameter)', 
                                                  abs(all_outcomes - hypothesized_purple_percent))
        dist_plot = distance_from_model.hist(title='Differences from Model \n Parameter (' + str(hypothesized_purple_percent) + '% Purple)')
        dist_plot.dot(abs(observed_purples - hypothesized_purple_percent))

Here is a scenario to match Mendel’s garden. 705 out of 929 plants had purple flowers, and he proposed 75% should be purple.

pea_plants_simulation(929, 705, 75, 1000)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[33], line 1
----> 1 pea_plants_simulation(929, 705, 75, 1000)

Cell In[32], line 17, in pea_plants_simulation(observed_sample_size, observed_purples_count, hypothesized_purple_percent, num_trials)
      3 """
      4 Parameters:
      5     - observed_sample_size: number of plants in our experiment
   (...)
     13     - Empirical distribution of how far off each trial was from our hypothesized model
     14 """
     16 observed_purples = 100 * observed_purples_count / observed_sample_size
---> 17 hypothesized_proportions = make_array(hypothesized_purple_percent/100, 
     18                                       1 - hypothesized_purple_percent/100)
     20 assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
     21 assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)

NameError: name 'make_array' is not defined

Here is a different scenario. Suppose only 506 out of 929 plants had purple flowers, and but our model indicates that 75% should be purple.

pea_plants_simulation(929, 506, 75, 1000)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 pea_plants_simulation(929, 506, 75, 1000)

Cell In[32], line 17, in pea_plants_simulation(observed_sample_size, observed_purples_count, hypothesized_purple_percent, num_trials)
      3 """
      4 Parameters:
      5     - observed_sample_size: number of plants in our experiment
   (...)
     13     - Empirical distribution of how far off each trial was from our hypothesized model
     14 """
     16 observed_purples = 100 * observed_purples_count / observed_sample_size
---> 17 hypothesized_proportions = make_array(hypothesized_purple_percent/100, 
     18                                       1 - hypothesized_purple_percent/100)
     20 assert 0 <= hypothesized_proportions[0] <= 1, str(hypothesized_proportions)
     21 assert 0 <= hypothesized_proportions[1] <= 1, str(hypothesized_proportions)

NameError: name 'make_array' is not defined

interact(pea_plants_simulation,  
         observed_sample_size = Fixed(929), 
         observed_purples_count = Slider(0,929), 
         hypothesized_purple_percent = Slider(0,100,1),
         num_trials=Slider(10, 5000, 100))

#
#
#

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[36], line 2
      1 interact(pea_plants_simulation,  
----> 2          observed_sample_size = Fixed(929), 
      3          observed_purples_count = Slider(0,929), 
      4          hypothesized_purple_percent = Slider(0,100,1),
      5          num_trials=Slider(10, 5000, 100))
      7 #
      8 #
      9 #

NameError: name 'Fixed' is not defined

CSCI 104: Data Science and Computing for All

Inference with Statistics