STAT 19000: Project 14 — Spring 2022
Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.
Context: We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We may even ask a question that asks you to use a tool you haven’t used before — but don’t worry, if we do, we will provide you with extra guidance.
Scope: Python, R, bash, unix, computers
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/airbnb/**/reviews.csv.gz -
/depot/datamine/data/election/itcont2022.txt -
/depot/datamine/data/death_records/DeathRecords.csv
Questions
Question 1
Scan through the reviews.csv.gz files in /depot/datamine/data/airbnb/* and find the 10 most common reviewer_name values.
|
The In particular, check out the example(s) in the basic use section. The |
|
You can read |
|
The following is an example of one way you could sum the values of a dictionary.
|
|
Test your code on a few of the |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
After completing question 1, it is likely you have a solid understanding on how the data is organized. Add some logic to your code from question 1 to instead print the 5 most common names for each country.
If your $HOME country (haha) is in the list — do the names sound about right? What kind of bias does this data likely show?
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Checkout the newest set of election data /depot/datamine/data/election/itcont2022.txt. Let’s say we are interested in all entries (rows) that have the word "purdue" in it (of course, this may include entries that don’t relate to Purdue University, but we are okay with that error).
This is around 5 GB of data, and only a small fraction of that has relevant information. In pandas, there is not an ergonomic way to check if a row of data has a string in it. This is where knowing how to use multiple tools will come in handy!
There is a tool called grep that can very quickly search large text files for certain text. We will learn more about grep (and other useful command line utilities) in STAT 29000. With that being said, why not figure out how to use grep to create a subset of data to read into pandas that is already filtered — it isn’t too bad!
Use grep to create a subset of data called my_election_data.txt. my_election_data.txt should contain only the rows that have the word "purdue" in it. my_election_data.txt should live in your $HOME directory: /home/purduealias/my_election_data.txt.
-
Use grep to find only rows with the word "purdue" in them (case insensitive). Use redirection to save the output to
$HOME/my_election_data.txt.You can use the
-iflag to make yourgrepsearch case insensitive — this means that rows with "Purdue" or "purdue" or "PuRdUe" would be found.You can run
grepfrom within Jupyter Notebooks using the%%bashmagic. For example, the following would find the word "apple" in a dataset and create a new file called "my_new_file.csv" in my$HOMEdirectory.%%bash grep "apple" /depot/datamine/data/yelp/data/json/yelp_academic_dataset_review.json > $HOME/my_new_file.csvIn order to insert the header line into your newly created file, you can run the following
sedcommand directly after yourgrepcommand.sed -i '1 i\CMTE_ID|AMNDT_IND|RPT_TP|TRANSACTION_PGI|IMAGE_NUM|TRANSACTION_TP|ENTITY_TP|NAME|CITY|STATE|ZIP_CODE|EMPLOYER|OCCUPATION|TRANSACTION_DT|TRANSACTION_AMT|OTHER_ID|TRAN_ID|FILE_NUM|MEMO_CD|MEMO_TEXT|SUB_ID' $HOME/my_election_data.txt -
Use
pandasto read in your newly created, much smaller dataset,$HOME/my_election_data.txt.
Finally, print the EMPLOYER, NAME, OCCUPATION, and TRANSACTION_AMT, for the top 10 donations (by size).
You may notice that each row represents a single donation. Group the data by the NAME column to get the total amount of donation per individual. What is the NAME of the top donor?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
What is the average age of death for individuals who were married, single, divorced, widowed, or unknown?
Further split the data by Sex — do the same patterns hold? Dig in a bit and notice that how we look at the data can make a very big difference!
-
Code used to solve this problem.
-
Output from running the code.
Question 5
It has been a fun year. We hope that you learned something new!
-
Write down 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 19000).
-
Write down 3 (or more) of your favorite projects and/or topics you wish you were able to learn more about.
-
Code used to solve this problem.
-
Output from running the code.
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |