Probability and Sampling#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 from datascience import *
      2 from cs104 import *
      3 import numpy as np

ModuleNotFoundError: No module named 'datascience'

1. Distributions#

Probability Distribution#

We can use probability rules to analytically write down the expected number of each possible value in order to create a probability distribution like the following

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 3
      1 # Sums of all possible combinations of two dice rolls. 
      2 # (The first few entries illustrate how we constructed these combinations.)
----> 3 outcomes = make_array(1+1,1+2,2+1,1+3,2+2,3+1,5,5,5,5,6,6,6,6,6,
      4                       7,7,7,7,7,7,8,8,8,8,8,9,9,9,9,10,10,10,11,11,12)
      5 outcome_bins = np.arange(1.5, 13.5, 1)
      6 plot = Table().with_columns('Sum of two dice rolls', outcomes).hist(bins=outcome_bins)

NameError: name 'make_array' is not defined

Empirical Distribution#

dice = np.arange(1,7)
dice

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 dice = np.arange(1,7)
      2 dice

NameError: name 'np' is not defined

Let’s roll the dice twice and add the values.

two_dice = np.random.choice(dice, 2)
print('two dice=', two_dice)
print('sum=', sum(two_dice))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 two_dice = np.random.choice(dice, 2)
      2 print('two dice=', two_dice)
      3 print('sum=', sum(two_dice))

NameError: name 'np' is not defined

Let’s put this together in a function that simulate can use as an input.

def sum_two_dice(): 
    dice = np.arange(1,7)
    two_dice = np.random.choice(dice, 2)
    return sum(two_dice)

Use simulate (from our inference library) to create an empirical distribution.

def simulate(make_one_outcome, num_trials):
    """
    Return an array of num_trials values, each 
    of which was created by calling make_one_outcome().
    """
    outcomes = make_array()
    for i in np.arange(0, num_trials):
        outcome = make_one_outcome()
        outcomes = np.append(outcomes, outcome)

    return outcomes

num_trials = 10 
simulate(sum_two_dice, num_trials)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 num_trials = 10 
----> 2 simulate(sum_two_dice, num_trials)

Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
      1 def simulate(make_one_outcome, num_trials):
      2     """
      3     Return an array of num_trials values, each 
      4     of which was created by calling make_one_outcome().
      5     """
----> 6     outcomes = make_array()
      7     for i in np.arange(0, num_trials):
      8         outcome = make_one_outcome()

NameError: name 'make_array' is not defined

num_trials = 2000 
all_outcomes = simulate(sum_two_dice, num_trials)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 2
      1 num_trials = 2000 
----> 2 all_outcomes = simulate(sum_two_dice, num_trials)

Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
      1 def simulate(make_one_outcome, num_trials):
      2     """
      3     Return an array of num_trials values, each 
      4     of which was created by calling make_one_outcome().
      5     """
----> 6     outcomes = make_array()
      7     for i in np.arange(0, num_trials):
      8         outcome = make_one_outcome()

NameError: name 'make_array' is not defined

simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
simulated_results

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
      2 simulated_results

NameError: name 'Table' is not defined

plot = simulated_results.hist(bins=outcome_bins)
plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials));
plot.set_ylim(0,0.175)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 plot = simulated_results.hist(bins=outcome_bins)
      2 plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials));
      3 plot.set_ylim(0,0.175)

NameError: name 'simulated_results' is not defined

Law of Averages#

In our simulation, we have one parameter that we have the ability to control num_trials. Does this parameter matter?

To find out, we can write a function that takes as input the num_trials parameter.

def simulate_and_plot_summing_two_dice(num_trials):
    """
    Simulates rollowing two dice and repeats num_trials times, and 
    Plots the empirical distribution
    """
    all_outcomes = simulate(sum_two_dice, num_trials)
    simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)

    outcome_bins = np.arange(1.5, 13.5, 1)
    plot = simulated_results.hist(bins=outcome_bins)
    plot.set_title('Empirical (approximate) distribution \n num_trials='+str(num_trials))
    plot.set_ylim(0,0.18)

simulate_and_plot_summing_two_dice(2000)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 simulate_and_plot_summing_two_dice(2000)

Cell In[12], line 6, in simulate_and_plot_summing_two_dice(num_trials)
      1 def simulate_and_plot_summing_two_dice(num_trials):
      2     """
      3     Simulates rollowing two dice and repeats num_trials times, and 
      4     Plots the empirical distribution
      5     """
----> 6     all_outcomes = simulate(sum_two_dice, num_trials)
      7     simulated_results = Table().with_column('Sum of two dice rolls', all_outcomes)
      9     outcome_bins = np.arange(1.5, 13.5, 1)

Cell In[7], line 6, in simulate(make_one_outcome, num_trials)
      1 def simulate(make_one_outcome, num_trials):
      2     """
      3     Return an array of num_trials values, each 
      4     of which was created by calling make_one_outcome().
      5     """
----> 6     outcomes = make_array()
      7     for i in np.arange(0, num_trials):
      8         outcome = make_one_outcome()

NameError: name 'make_array' is not defined

Here are a couple plots for different numbers of trials:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 with Figure(1,3, sharey=True):
      2     simulate_and_plot_summing_two_dice(100)
      3     simulate_and_plot_summing_two_dice(500)

NameError: name 'Figure' is not defined

interact(simulate_and_plot_summing_two_dice, num_trials = Slider(1,1000))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 interact(simulate_and_plot_summing_two_dice, num_trials = Slider(1,1000))

NameError: name 'Slider' is not defined

Here is an animation also showing how the number of trials impacts the result for a wider range of values.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 8
      2     for num_trials in [10,20,30,40,50,60, 70, 80, 90,
      3                        100, 200, 300, 400, 500, 600, 700, 800, 900,
      4                        1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 
      5                        10000]:
      6         yield locals()
----> 8 animate(simulate_and_plot_summing_two_dice, gen, default_mode='reflect', show_params=False, interval=200)

NameError: name 'animate' is not defined

2. Random Sampling: Florida Votes in 2016#

Load data for voting in Florida in 2016. These give us the true parameters if we were able to poll every person who would turn out to vote:

Proportion voting for (Trump, Clinton, Johnson, other) = (0.49, 0.478, 0.022, 0.01)
Raw counts:
- Trump: 4,617,886
- Clinton: 4,504,975
- Johnson: 207,043
- Other: 90,135

Data is based on the actual votes case in the election.

votes = Table().read_table('data/florida_2016.csv')
votes = votes.with_column('Vote', votes.apply(make_array("Trump", "Clinton", "Johnson", "Other").item, "Vote"))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 votes = Table().read_table('data/florida_2016.csv')
      2 votes = votes.with_column('Vote', votes.apply(make_array("Trump", "Clinton", "Johnson", "Other").item, "Vote"))

NameError: name 'Table' is not defined

votes.show(5)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 votes.show(5)

NameError: name 'votes' is not defined

Here’s the total number of votes cast in the election.

votes.num_rows

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 votes.num_rows

NameError: name 'votes' is not defined

We can pick a “convenience sample”: the first 10 voters who show up in line.

votes.take(np.arange(10))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 votes.take(np.arange(10))

NameError: name 'votes' is not defined

Since we are analyzing this after the election, we actually know the votes for the full population and we can compute the true parameter.

sum(votes.column('Vote') == 'Trump') / votes.num_rows

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 sum(votes.column('Vote') == 'Trump') / votes.num_rows

NameError: name 'votes' is not defined

But suppose this is before the election and we actually can’t ask every person in the state how they will vote…

In that case, we can imagine we are a pollster, and sample 50 people.

We can use .sample(n) to randomly sample n rows from a table.

sample = votes.sample(50)
sample

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 sample = votes.sample(50)
      2 sample

NameError: name 'votes' is not defined

sum(sample.column('Vote') == 'Trump') / sample.num_rows

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 sum(sample.column('Vote') == 'Trump') / sample.num_rows

NameError: name 'sample' is not defined

Let’s write functions to do this!

A function that takes a sample
A function that computes the statistic (proportion of the sample that voted for Trump).

def sample_votes(sample_size): 
    return votes.sample(sample_size)

def proportion_vote_trump(sample): 
    return sum(sample.column('Vote') == 'Trump') / sample.num_rows

sample = sample_votes(100)
proportion_vote_trump(sample)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 sample = sample_votes(100)
      2 proportion_vote_trump(sample)

Cell In[25], line 2, in sample_votes(sample_size)
      1 def sample_votes(sample_size): 
----> 2     return votes.sample(sample_size)

NameError: name 'votes' is not defined

proportion_vote_trump(sample_votes(100))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 proportion_vote_trump(sample_votes(100))

Cell In[25], line 2, in sample_votes(sample_size)
      1 def sample_votes(sample_size): 
----> 2     return votes.sample(sample_size)

NameError: name 'votes' is not defined

proportion_vote_trump(sample_votes(1000))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 proportion_vote_trump(sample_votes(1000))

Cell In[25], line 2, in sample_votes(sample_size)
      1 def sample_votes(sample_size): 
----> 2     return votes.sample(sample_size)

NameError: name 'votes' is not defined

So far, we’ve been using a simulate function. Let’s extend this to a function that can also take a sample size. We’ll call this function simulate_sample_statistic.

def simulate_sample_statistic(make_one_sample, sample_size,
                              compute_sample_statistic, num_trials):
    """
    Simulates num_trials sampling steps and returns an array of the
    statistic for those samples.  The parameters are:

    - make_one_sample: a function that takes an integer n and returns a 
                   sample as an array of n elements.
    
    - sample_size: the size of the samples to use in the simulation.
    
    - compute_statistic: a function that takes a sample as 
                         an array and returns the statistic for that sample. 
    
    - num_trials: the number of simulation steps to perform.
    """

    simulated_statistics = make_array()
    for i in np.arange(0, num_trials):
        simulated_sample = make_one_sample(sample_size)
        sample_statistic = compute_sample_statistic(simulated_sample)
        simulated_statistics = np.append(simulated_statistics, sample_statistic)
    return simulated_statistics

Let’s use our simulation algorithm to create an empirical distribution.

Suppose there are 1,000 polling companies and each uses a sample of 100 people.

num_trials = 1000 #1,000 polling companies
sample_size = 100 #100 people sampled by each polling company 

all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
                                         proportion_vote_trump, num_trials)
all_outcomes

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[31], line 4
      1 num_trials = 1000 #1,000 polling companies
      2 sample_size = 100 #100 people sampled by each polling company 
----> 4 all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
      5                                          proportion_vote_trump, num_trials)
      6 all_outcomes

Cell In[30], line 18, in simulate_sample_statistic(make_one_sample, sample_size, compute_sample_statistic, num_trials)
      1 def simulate_sample_statistic(make_one_sample, sample_size,
      2                               compute_sample_statistic, num_trials):
      3     """
      4     Simulates num_trials sampling steps and returns an array of the
      5     statistic for those samples.  The parameters are:
   (...)
     15     - num_trials: the number of simulation steps to perform.
     16     """
---> 18     simulated_statistics = make_array()
     19     for i in np.arange(0, num_trials):
     20         simulated_sample = make_one_sample(sample_size)

NameError: name 'make_array' is not defined

simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
plot = simulated_results.hist()

title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)
plot.set_title(title)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[32], line 1
----> 1 simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
      2 plot = simulated_results.hist()
      4 title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)

NameError: name 'Table' is not defined

Let’s make a function with our two free parameters, num_trials and sample_size.

def simulate_and_plot_trump_pollster(num_trials, sample_size): 
    all_outcomes = simulate_sample_statistic(sample_votes, sample_size,
                        proportion_vote_trump, num_trials)
    simulated_results = Table().with_column('Proportion voting for Trump', all_outcomes)
    plot = simulated_results.hist(bins=np.arange(0.3,0.71,0.025))
    title = 'Empirical (approximate) distribution \n num_trials='+str(num_trials)+ '\n sample_size='+str(sample_size)
    plot.set_title(title)    

Here are a few choices for parameters. Notice how each impacts the resulting histogram.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 with Figure(2,2, sharey=True, sharex=True):
      2     import matplotlib.pyplot as plots
      3     simulate_and_plot_trump_pollster(100, 200)

NameError: name 'Figure' is not defined

interact(simulate_and_plot_trump_pollster, 
         num_trials = Choice(make_array(1,10,100,1000,5000)), 
         sample_size = Choice(make_array(1,10,100,1000,5000)))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[36], line 2
      1 interact(simulate_and_plot_trump_pollster, 
----> 2          num_trials = Choice(make_array(1,10,100,1000,5000)), 
      3          sample_size = Choice(make_array(1,10,100,1000,5000)))

NameError: name 'Choice' is not defined

As another way to look at it, here’s an visualization showing the empirical distribution for a small sample size with varying numbers of trials.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[38], line 8
      3     plots.xlim(0.3, 0.7)
      4     plots.ylim(0,30)    
      7 interact(bounded_simulate_and_plot_trump_pollster, 
----> 8          num_trials = Slider(10,10000),
      9          sample_size = Fixed(100))

NameError: name 'Slider' is not defined

And here’s one show showing the empirical distribution for varying sample sizes.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[40], line 8
      3     plots.xlim(0.3, 0.7)
      4     plots.ylim(0,70)    
      7 interact(bounded_simulate_and_plot_trump_pollster, 
----> 8          num_trials = Fixed(1000),
      9          sample_size = Slider(10,5000))

NameError: name 'Fixed' is not defined

And here’s an animation showing the empirical distribution for varying sample sizes.

Big picture questions sampling:

Why wouldn’t we always just take really big of samples since they converge to the true distribution?

Big picture questions simulations:

What are we abstracting away when we’re writing code? What are we re-using over and over?

CSCI 104: Data Science and Computing for All

Probability and Sampling