Functions
Contents
Functions#
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline
1. Overlaid Histograms#
Here’s a dataset of adults and the heights of their parents.
heights_original = Table().read_table('data/galton.csv')
heights_original.show(5)
family | father | mother | midparentHeight | children | childNum | gender | childHeight |
---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 75.43 | 4 | 1 | male | 73.2 |
1 | 78.5 | 67 | 75.43 | 4 | 2 | female | 69.2 |
1 | 78.5 | 67 | 75.43 | 4 | 3 | female | 69 |
1 | 78.5 | 67 | 75.43 | 4 | 4 | female | 69 |
2 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 |
... (929 rows omitted)
Let’s focus on the female adult children first.
heights = heights_original.where('gender', 'female').select('father', 'mother', 'childHeight')
heights = heights.relabeled('childHeight', 'daughter')
heights
father | mother | daughter |
---|---|---|
78.5 | 67 | 69.2 |
78.5 | 67 | 69 |
78.5 | 67 | 69 |
75.5 | 66.5 | 65.5 |
75.5 | 66.5 | 65.5 |
75 | 64 | 68 |
75 | 64 | 67 |
75 | 64 | 64.5 |
75 | 64 | 63 |
75 | 58.5 | 66.5 |
... (443 rows omitted)
plot = heights.hist('daughter')
plot.set_xlabel("Daughter height (inches)")
What are some common heights we can see from the histogram above?
A concentration around 63 inches (5’4”).
plot = heights.hist('mother')
plot.set_xlabel("Mother height (inches)")
Recall, we can use overlaid histograms to compare the distribution of two variables that have the same units. (Note: We’ve seen how to group by a column containing a categorical. It’s also possible to just call hist
with multiple column names to produce an overlaid histogram.)
plot = heights.hist('daughter', 'mother')
plot.set_xlabel('Height (inches)')
plot = heights.hist()
plot.set_xlabel('Height (inches)')
We can specify bins ourselves to make them have a different width and number.
our_bins = np.arange(55, 81, 1)
print("bins=", our_bins)
plot = heights.hist(bins=our_bins)
plot.set_xlabel('Height (inches)')
bins= [55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
79 80]
Let’s now take a different slice of the original data
heights_original.show(3)
family | father | mother | midparentHeight | children | childNum | gender | childHeight |
---|---|---|---|---|---|---|---|
1 | 78.5 | 67 | 75.43 | 4 | 1 | male | 73.2 |
1 | 78.5 | 67 | 75.43 | 4 | 2 | female | 69.2 |
1 | 78.5 | 67 | 75.43 | 4 | 3 | female | 69 |
... (931 rows omitted)
heights = heights_original.select('father', 'mother', 'childHeight').relabeled('childHeight', 'child')
heights
father | mother | child |
---|---|---|
78.5 | 67 | 73.2 |
78.5 | 67 | 69.2 |
78.5 | 67 | 69 |
78.5 | 67 | 69 |
75.5 | 66.5 | 73.5 |
75.5 | 66.5 | 72.5 |
75.5 | 66.5 | 65.5 |
75.5 | 66.5 | 65.5 |
75 | 64 | 71 |
75 | 64 | 68 |
... (924 rows omitted)
plot = heights.hist(bins=np.arange(55,80,2))
plot.set_xlabel('Height (inches)')
Question: Why is the maximum height of a bar for child smaller than that for mother or father?
A: We have a larger spread because we have both male and female children.
2. Functions#
We use functions all the time. They do computation for us without us describing every single step. That saves us time – we don’t have to write the code – and let’s us perform those operations without even caring how they are implemented. Example: max
: we have an idea of how we’d take the maximum of a list of numbers, but we can just use that function in Python without explicitely describing how it works.
Can we do the same for other computations? Yes! It’s a core principle of programming: define functions for tasks you do often so you never have to repeat writing the code.
Defining and calling our own functions#
def double(x):
""" Double x """
return 2*x
double(5)
10
double(double(5))
20
Scoping: parameter only “visible” inside the function definition
x #should throw an error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 x #should throw an error
NameError: name 'x' is not defined
double(5/4)
2.5
y = 5
double(y/4)
2.5
x # we still can't access the parameter
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[19], line 1
----> 1 x # we still can't access the parameter
NameError: name 'x' is not defined
x = 1.5
double(x)
3.0
x
1.5
What happens if I double an array?
double(make_array(3,4,5))
array([ 6, 8, 10])
What happens if I double a string?
double("string")
'stringstring'
5*"string"
'stringstringstringstringstring'
More functions#
Think-pair-share:
What is this function below doing?
How would you rewrite this function (the name of the function, the docstring, the parameters) in order to make it more clear?
def f(s):
total = sum(s)
return np.round(s / total * 100, 2)
A: Always use meaningul names.
def percents(counts):
"""Convert the counts to percents out of the total."""
total = sum(counts)
return np.round(counts / total * 100, 2)
Note that we have a local variable total
in our definition….
f(make_array(2, 4, 8, 6, 10))
array([ 6.67, 13.33, 26.67, 20. , 33.33])
percents(make_array(2, 4, 8, 6, 10))
array([ 6.67, 13.33, 26.67, 20. , 33.33])
Remember scoping here too!
total
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[29], line 1
----> 1 total
NameError: name 'total' is not defined
Accessing global variables#
Global variables are defined outside any function – we’ve been using them all along. You can access global variables that you have defined inside your functions. Always define globals before functions that use them to avoid confusion and surprising results when you rerun your whole notebook!
heights.show(3)
father | mother | child |
---|---|---|
78.5 | 67 | 73.2 |
78.5 | 67 | 69.2 |
78.5 | 67 | 69 |
... (931 rows omitted)
def children_under_height(height):
"""Proportion of children in our data set that are no taller than the given height."""
return heights.where("child", are.below_or_equal_to(height)).num_rows / heights.num_rows
children_under_height(65)
0.3747323340471092
Functions with more than one parameter#
We can add functions with more than one parameter.
#original function
def percents(counts):
"""Convert the counts to percents out of the total."""
total = sum(counts)
return np.round(counts / total * 100, 2)
#function with two parameters
def percents_two_params(counts, decimals_to_round):
"""Convert the counts to percents out of the total."""
total = sum(counts)
return np.round(counts / total * 100, decimals_to_round)
counts = make_array(2, 4, 8, 6, 10)
percents(counts)
array([ 6.67, 13.33, 26.67, 20. , 33.33])
percents_two_params(counts, 2)
array([ 6.67, 13.33, 26.67, 20. , 33.33])
percents_two_params(counts, 1)
array([ 6.7, 13.3, 26.7, 20. , 33.3])
percents_two_params(counts, 0)
array([ 7., 13., 27., 20., 33.])
percents_two_params(counts, 3)
array([ 6.667, 13.333, 26.667, 20. , 33.333])
Let’s write a function that given the unique id of an observation (a row) gives us the value of a particular column.
heights_id = heights.with_columns('id', np.arange(heights.num_rows))
heights_id.show(5)
father | mother | child | id |
---|---|---|---|
78.5 | 67 | 73.2 | 0 |
78.5 | 67 | 69.2 | 1 |
78.5 | 67 | 69 | 2 |
78.5 | 67 | 69 | 3 |
75.5 | 66.5 | 73.5 | 4 |
... (929 rows omitted)
def find_a_value(table, observation_id, column_name):
return table.where('id', are.equal_to(observation_id)).column(column_name).item(0)
find_a_value(heights_id, 2, 'mother')
67.0
find_a_value(heights_id, 200, 'mother')
63.0
Great! Now we can keeping using a function we wrote throughout this class to speed up work in the same way we’re using functions built-in to Python, e.g. max
, or the datascience package, e.g. .take()
3. Apply#
There are times we want to perform mathematical operations columns of the table but can’t use array broadcasting…
min(make_array(70, 73, 69), 72) #should be an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[45], line 1
----> 1 min(make_array(70, 73, 69), 72) #should be an error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
def cut_off_at_72(x):
"""The smaller of x and 72"""
return min(x, 72)
cut_off_at_72(62)
62
cut_off_at_72(72)
72
cut_off_at_72(78)
72
The table apply
method applies a function to every entry in a column.
heights
father | mother | child |
---|---|---|
78.5 | 67 | 73.2 |
78.5 | 67 | 69.2 |
78.5 | 67 | 69 |
78.5 | 67 | 69 |
75.5 | 66.5 | 73.5 |
75.5 | 66.5 | 72.5 |
75.5 | 66.5 | 65.5 |
75.5 | 66.5 | 65.5 |
75 | 64 | 71 |
75 | 64 | 68 |
... (924 rows omitted)
heights.hist('child')
cut_off = heights.apply(cut_off_at_72, 'child')
height2 = heights.with_columns('child', cut_off)
height2.hist('child')
Like we did with variables, we can call functions and their types. In Python, help
prints out the docstring of a function.
cut_off_at_72
<function __main__.cut_off_at_72(x)>
type(cut_off_at_72)
function
help(cut_off_at_72)
Help on function cut_off_at_72 in module __main__:
cut_off_at_72(x)
The smaller of x and 72
Apply with multiple columns#
heights.show(6)
father | mother | child |
---|---|---|
78.5 | 67 | 73.2 |
78.5 | 67 | 69.2 |
78.5 | 67 | 69 |
78.5 | 67 | 69 |
75.5 | 66.5 | 73.5 |
75.5 | 66.5 | 72.5 |
... (928 rows omitted)
parent_max = heights.apply(max, 'mother', 'father')
parent_max.take(np.arange(0, 6))
array([78.5, 78.5, 78.5, 78.5, 75.5, 75.5])
def average(x, y):
"""Compute the average of two values"""
return (x+y)/2
parent_avg = heights.apply(average, 'mother', 'father')
parent_avg.take(np.arange(0, 6))
array([72.75, 72.75, 72.75, 72.75, 71. , 71. ])