Boostrapping#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

1. Salary Data#

This is a dataset of salaries that came from a real-world survey. You can read more about this dataset here or here.

salaries = Table().read_table("data/salaries_clean.csv")
salaries = salaries.with_columns('Job title', salaries.apply(str.lower, 'Job title'))
salaries.show(5)
Job title Yearly Salary (USD) Years Work Experience
research and instruction librarian 55000 5-7 years
marketing specialist 34000 2 - 4 years
program manager 62000 8 - 10 years
accounting manager 60000 8 - 10 years
scholarly publishing librarian 62000 8 - 10 years

... (23224 rows omitted)

data_jobs = salaries.where("Job title", are.containing("data"))
data_jobs.sample(5)
Job title Yearly Salary (USD) Years Work Experience
data analyst 100000 8 - 10 years
principal project data manager 124000 21 - 30 years
sr database administrator 106000 11 - 20 years
member data specialist 38000 2 - 4 years
product owner: data, reporting, & analytics 102550 8 - 10 years
data_jobs.num_rows
506
data_job_salaries = data_jobs.column("Yearly Salary (USD)")

Mean and median salaries#

The mean is the “balancing point” of a histogram. The median is our “half-way point” of the data. For symmetric distributions, there are very close, but as a distribution becomes skewed to either larger or smaller values, they can become quite different:

../_images/24-bootstrapping_11_0.png

Let’s explore the mean and median salaries for data jobs.

data_salary_mean = np.mean(data_job_salaries)
data_salary_mean
99742.75494071146
data_salary_median = np.median(data_job_salaries)
data_salary_median
92000.0

Here’s the mean (triangle) and median (circle) for our ‘Jobs with Data’ sample.

plot = data_jobs.hist("Yearly Salary (USD)")
plot.set_title("Salary of 'Jobs with Data' \n Sample Size="+str(data_jobs.num_rows))
plot.dot(data_salary_median)
plot.dot(data_salary_mean, marker='^')
../_images/24-bootstrapping_16_0.png

2. Bootstrapping#

Given our sample, can we estimate the median salary for data jobs for the entire population?

Let’s sample with replacement for one sample.

print("Size of original sample", len(data_job_salaries))
Size of original sample 506
simulated_resample = np.random.choice(data_job_salaries, len(data_job_salaries))
print("Size after we resampled the original sample=", len(simulated_resample))
Size after we resampled the original sample= 506
# Run many times 
simulated_resample = np.random.choice(data_job_salaries, len(data_job_salaries))
median = np.median(simulated_resample)

table = Table().with_columns("Yearly Salary (USD)", simulated_resample)
plot = table.hist("Yearly Salary (USD)", bins=np.arange(0,300000,25000))
plot.set_title("Resample -- Salary of 'Jobs with Data' \n Median="+str(median))
plot.dot(median)
../_images/24-bootstrapping_22_0.png

Here are some more resamples.

../_images/24-bootstrapping_24_0.png

Let’s build a new simulation function to compute a statistic for many resamples. We’ll write it here, and it’s in our library for you to use.

def bootstrap_statistic(sample, compute_statistic, num_trials): 
    """
    Creates num_trials resamples of the initial sample.
    Returns an array of the provided statistic for those samples.

    * sample: the initial sample, as an array.
    
    * compute_statistic: a function that takes a sample as 
                         an array and returns the statistic for that
                         sample. 
    
    * num_trials: the number of bootstrap samples to create.

    """
    statistics = make_array()
    
    for i in np.arange(0, num_trials): 
        #Key: in bootstrapping we must always sample with replacement 
        simulated_resample = np.random.choice(sample, len(sample))
        
        resample_statistic = compute_statistic(simulated_resample)
        statistics = np.append(statistics, resample_statistic)
    
    return statistics
results = bootstrap_statistic(data_job_salaries, np.median, 1000)
table = Table().with_columns("Yearly Salary (USD)", results)
plot = table.hist("Yearly Salary (USD)", bins=np.arange(87000,100000,1000))
plot.set_title("Bootstrap 1000 Times \n Sample Size="+str(data_jobs.num_rows))
plot.dot(data_salary_median)
../_images/24-bootstrapping_28_0.png

The above histogram captures the variability we see in the median salary in our resamples. That variability matches the variability we’d see in repeatedly sampling the whole population. We’ll see in the next lecture how to quantify the variability when estimating the median salary for the whole the population.