Boostrapping
Contents
Boostrapping#
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline
1. Salary Data#
This is a dataset of salaries that came from a real-world survey. You can read more about this dataset here or here.
salaries = Table().read_table("data/salaries_clean.csv")
salaries = salaries.with_columns('Job title', salaries.apply(str.lower, 'Job title'))
salaries.show(5)
Job title | Yearly Salary (USD) | Years Work Experience |
---|---|---|
research and instruction librarian | 55000 | 5-7 years |
marketing specialist | 34000 | 2 - 4 years |
program manager | 62000 | 8 - 10 years |
accounting manager | 60000 | 8 - 10 years |
scholarly publishing librarian | 62000 | 8 - 10 years |
... (23224 rows omitted)
data_jobs = salaries.where("Job title", are.containing("data"))
data_jobs.sample(5)
Job title | Yearly Salary (USD) | Years Work Experience |
---|---|---|
data analyst | 100000 | 8 - 10 years |
principal project data manager | 124000 | 21 - 30 years |
sr database administrator | 106000 | 11 - 20 years |
member data specialist | 38000 | 2 - 4 years |
product owner: data, reporting, & analytics | 102550 | 8 - 10 years |
data_jobs.num_rows
506
data_job_salaries = data_jobs.column("Yearly Salary (USD)")
Mean and median salaries#
The mean is the “balancing point” of a histogram. The median is our “half-way point” of the data. For symmetric distributions, there are very close, but as a distribution becomes skewed to either larger or smaller values, they can become quite different:
Let’s explore the mean and median salaries for data jobs.
data_salary_mean = np.mean(data_job_salaries)
data_salary_mean
99742.75494071146
data_salary_median = np.median(data_job_salaries)
data_salary_median
92000.0
Here’s the mean (triangle) and median (circle) for our ‘Jobs with Data’ sample.
plot = data_jobs.hist("Yearly Salary (USD)")
plot.set_title("Salary of 'Jobs with Data' \n Sample Size="+str(data_jobs.num_rows))
plot.dot(data_salary_median)
plot.dot(data_salary_mean, marker='^')
2. Bootstrapping#
Given our sample, can we estimate the median salary for data jobs for the entire population?
Let’s sample with replacement for one sample.
print("Size of original sample", len(data_job_salaries))
Size of original sample 506
simulated_resample = np.random.choice(data_job_salaries, len(data_job_salaries))
print("Size after we resampled the original sample=", len(simulated_resample))
Size after we resampled the original sample= 506
# Run many times
simulated_resample = np.random.choice(data_job_salaries, len(data_job_salaries))
median = np.median(simulated_resample)
table = Table().with_columns("Yearly Salary (USD)", simulated_resample)
plot = table.hist("Yearly Salary (USD)", bins=np.arange(0,300000,25000))
plot.set_title("Resample -- Salary of 'Jobs with Data' \n Median="+str(median))
plot.dot(median)
Here are some more resamples.
Let’s build a new simulation function to compute a statistic for many resamples. We’ll write it here, and it’s in our library for you to use.
def bootstrap_statistic(sample, compute_statistic, num_trials):
"""
Creates num_trials resamples of the initial sample.
Returns an array of the provided statistic for those samples.
* sample: the initial sample, as an array.
* compute_statistic: a function that takes a sample as
an array and returns the statistic for that
sample.
* num_trials: the number of bootstrap samples to create.
"""
statistics = make_array()
for i in np.arange(0, num_trials):
#Key: in bootstrapping we must always sample with replacement
simulated_resample = np.random.choice(sample, len(sample))
resample_statistic = compute_statistic(simulated_resample)
statistics = np.append(statistics, resample_statistic)
return statistics
results = bootstrap_statistic(data_job_salaries, np.median, 1000)
table = Table().with_columns("Yearly Salary (USD)", results)
plot = table.hist("Yearly Salary (USD)", bins=np.arange(87000,100000,1000))
plot.set_title("Bootstrap 1000 Times \n Sample Size="+str(data_jobs.num_rows))
plot.dot(data_salary_median)
The above histogram captures the variability we see in the median salary in our resamples. That variability matches the variability we’d see in repeatedly sampling the whole population. We’ll see in the next lecture how to quantify the variability when estimating the median salary for the whole the population.