STAT 19000: Project 9 — Spring 2022
Motivation: Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable.
Context: At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using pandas by using pandas to solve data-driven problems.
Scope: Python, pandas
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/disney/total.parquet
Questions
Question 1
Let’s start by reading in the cleaned up and combined dataset. This is just the cleaned up dataset — essentially the same thing you got as a result from much of your processing from project 7.
How many rows of data are there for each ride?
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Recall that a single row of data either has a value for SPOSTMIN or SACTMIN, but not both. How many rows of data are there in total? How many non-null rows for SPOSTMIN? How many non-null rows for SACTMIN? Create a new dataframe called reduced where:
-
Each row has a value for both
SPOSTMINandSACTMIN. The value in theSPOSTMINcolumn is the value for the closestSPOSTMINvalue in seconds from the datetime shown for theSACTMINvalue. -
There is a new column called
time_diffthat is the difference (in seconds) between theSACTMINvalue and associated closestSPOSTMINvalue.
|
This is the toughest question for this project. So it is OK if it takes you a bit more time to think of a solution. |
|
Check out the Don’t worry too much about edge cases — as long as you are close, you will get full credit. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
How many fewer rows does reduced have than the original dataset? What does the time_diff column look like?
In project 7 you calculated the median SPOSTMIN and SACTMIN by ride_name. Perform the same operation on reduced. Are the SACTMIN and SPOSTMIN medians closer or further away than our not-cleaned data from project 7?
Do you think that, overall, the data in reduced is close enough (by time) to be able to draw comparisons? Why or why not?
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Any observation where the (absolute) time_diff is greater than an hour is probably not very high quality. Remove said observations from reduced. How many rows are left in reduced?
Finally, explore the refined dataset, reduced, more. Write a question you would like to have answered down, what you think the answer will be, and do your best to used the dataset to answer your question.
Your analysis should include: a question, your hypothesis, at least 1 graphic, any and all code you used, and your conclusions. You will not be graded on whether or not you are correct, but rather the effort you put into your analysis. Any good effort including the requirements will receive full credit. Have fun!
-
Code used to solve this problem.
-
Output from running the code.
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |