STAT 19000: Project 7 — Spring 2021
Motivation: There is one pretty major topic that we have yet to explore in Python — functions! A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code.
Context: We are taking a small hiatus from our pandas and numpy focused series to learn about and write our own functions in Python!
Scope: python, functions, pandas
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/yelp/data/parquet
Questions
Question 1
You’ve been given a path to a folder for a dataset. Explore the files. Give a brief description of the files and what each file contains.
|
Take a look at the size of each of the files. If you are interested in experimenting, try using |
-
Python code used to solve the problem.
-
Output from running your code.
-
The name of each dataset and a brief summary of each dataset. No more than 1-2 sentences about each dataset.
Question 2
Read the businesses.parquet file into a pandas DataFrame called businesses. Take a look to the hours and attributes columns. If you look closely, you’ll observe that both columns contain a lot more than a single feature. In fact, the attributes column contains 39 features and the hours column contains 7!
len(businesses.loc[:, "attributes"].iloc[0].keys()) # 39
len(businesses.loc[:, "hours"].iloc[0].keys()) # 7
Let’s start by writing a simple function. Create a function called has_attributes that takes a business_id as an argument, and returns True if the business has any attributes and False otherwise. Test it with the following code:
print(has_attributes('f9NumwFMBDn751xgFiRbNA')) # True
print(has_attributes('XNoUzKckATkOD1hP6vghZg')) # False
print(has_attributes('Yzvjg0SayhoZgCljUJRF9Q')) # True
print(has_attributes('7uYJJpwORUbCirC1mz8n9Q')) # False
While this is useful to get whether or not a single business has any attributes, if you wanted to apply this function to the entire attributes column/Series, you would just use the notna method:
businesses.loc[:, "attributes"].notna()
|
Make sure your return value is of type
|
-
Python code used to solve the problem.
-
Output from running the provided "test" code.
Question 3
Take a look at the attributes of the first business:
businesses.loc[:, "attributes"].iloc[0]
What is the type of the value? Let’s assume the company you work for gets data formatted like businesses each week, but your boss wants the 39 features in attributes and the 7 features in hours to become their own columns. Write a function called fix_businesses_data that accepts an argument called data_path (of type str) that is a full path to a parquet file that is in the exact same format as businesses.parquet. In addition to the data_path argument, fix_businesses_data should accept another argument called output_dir (of type str). output_dir should contain the path where you want your "fixed" parquet file to output. fix_businesses_data should return None.
The result of your function, should be a new file called new_businesses.parquet saved in the output_dir, the data in this file should no longer contain either the attributes or hours columns. Instead, each row should contain 39+7 new columns. Test your function out:
from pathlib import Path
my_username = "kamstut" # replace "kamstut" with YOUR username
fix_businesses_data(data_path="/class/datamine/data/yelp/data/parquet/businesses.parquet", output_dir=f"/scratch/scholar/{my_username}")
# see if output exists
p = Path(f"/scratch/scholar/{my_username}").glob('**/*')
files = [x for x in p if x.is_file()]
print(files)
|
Make sure that either |
from pathlib import Path
def fix_businesses_data(data_path: str, output_dir: str) -> None:
"""
fix_data accepts a parquet file that contains data in a specific format.
fix_data "explodes" the attributes and hours columns into 39+7=46 new
columns.
Args:
data_path (str): Full path to a file in the same format as businesses.parquet.
output_dir (str): Path to a directory where new_businesses.parquet should be output.
"""
# read in original parquet file
businesses = pd.read_parquet(data_path)
# unnest the attributes column
# unnest the hours column
# output new file
businesses.to_parquet(str(Path(f"{output_dir}").joinpath("new_businesses.parquet")))
return None
|
Check out the code below, notice how using |
from pathlib import Path
print(Path("/class/datamine/data/").joinpath("my_file.txt"))
print(Path("/class/datamine/data").joinpath("my_file.txt"))
|
You can test out your function on |
|
If we were using R and the |
|
This stackoverflow post should be very useful! Specifically, run this code and take a look at the output:
|
Notice that some rows have json, and others have None:
businesses.loc[0, "attributes"] # has json
businesses.loc[2, "attributes"] # has None
This method allows us to handle both cases. If the row has json it converts the values, if it has None it just puts each column with a value of None.
|
Here is an example that shows you how to concatenate (combine) dataframes. |
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
That’s a pretty powerful function, and could definitely be useful. What if, instead of working on just our specifically formatted parquet file, we wrote a function that worked for any pandas DataFrame? Write a function called unnest that accepts a pandas DataFrame as an argument (let’s call this argument myDF), and a list of columns (let’s call this argument columns), and returns a DataFrame where the provided columns are unnested.
|
You may write |
|
The following should work:
|
-
Python code used to solve the problem.
-
Output from running the provided code.
Question 5
Try out the code below. If a provided column isn’t already nested, the column name is ruined and the data is changed. If the column doesn’t already exist, a KeyError is thrown. Modify our function from question (4) to skip unnesting if the column doesn’t exist. In addition, modify the function from question (4) to skip the column if the column isn’t nested. Let’s consider a column nested if the value of the column is a dict, and not nested otherwise.
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
new_businesses_df = unnest(businesses, ["doesntexist",]) # KeyError
new_businesses_df = unnest(businesses, ["postal_code",]) # not nested
To test your code, run the following. The result should be a DataFrame where attributes has been unnested, and that is it.
businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet")
results = unnest(businesses, ["doesntexist", "postal_code", "attributes"])
results.shape # (209393, 39)
results.head()
|
To see if a variable is a
|
-
Python code used to solve the problem.
-
Output from running the provided code.