TDM 10200: Project 3 — 2024
Motivation: Learning about Big Data. When working with large data sets, it is important to know how we can use control flow to find our information, a little bit at a time, without reading in all of the files at once. Control flow is the order that your code runs.
Scope: Python, Control Flow, if statements, for loops
Readings and Resources
|
We added some videos to help you with Project 3. |
Questions
Question 1 (2 pts)
-
Explore the files in the provided data set directory. Find out how many years are included in the data set. Briefly describe the contents of the files.
-
Import
pandasandpathlibusing:import pandas as pdand alsofrom pathlib import Path -
Create a list named
myfiles, to holdPathobjects from1880.csvto1883.csvin the data set folder usinglist comprehension. You can start with the following sample code (below), but you need to modify this for loop, to uselist comprehension.Following is the sample code that will return a "Path" object for the file
1750.csv.Path("/anvil/projects/tdm/data/noaa/1750.csv")You can start with a for loop, to create a list of Path objects, as follows, BUT we want you to modify this example, to use
list comprehension.myfiles=[] for year in range (1880, 1884): file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') myfiles.append(file) print(myfiles)
Question 2 (2 pts)
-
Calculate how many records are in the file
1880.csv. (Each line is one record.)The following is the sample code to calculate records in one sample file object named
file:with open(file,"r") as f: mycount = 0 for line in f: mycount += 1 print(f'There are {mycount} records in the file called {file}')There are 370779 records in the file called /anvil/projects/tdm/data/noaa/1880.csv
-
Calculate how many records there are (altogether) in the 4 files from
1880.csvto1883.csv. Use the listmyfilesthat you created in Question 1. Your output should give the total number of records altogether, so it should say something like:
There are [put your number of records here] records in the 4 files altogether.
|
Question 3 (2 pts)
-
Run the following statement, to read in the first file from the list
myfilesinto a DataFrame usingmyDF = pd.read_csv(myfiles[0]). Display the column names formyDF. Look at the head and tail ofmyDF. Do you see anything unexpected? -
Please modify your work from Question 3a, to correct the problem that you found. What are the column names now? Hint: Using the
header=Noneargument will be useful. -
Now let us add these 7 column names:
id,date,element_code,value,mflag,qflag,sflag, andobstimeto the data frame. You can do this using:pd.read_csv(myfiles[0],names = ["id","date","element_code","value","mflag","qflag","sflag","obstime"]) -
Make a list called
mydataframes(of length 4) that contains 4 data frames, one for each year, from1880.csvto1883.csv. Starting with the sample code (above) for reading in the first file, modify our example, so that you have a "for" loop that reads in all 4 files. Test your work with aforloop that displays the column names of each of the four data frames inmydataframes. You can show the column names ofmyDFusingmyDF.columns.
Question 4 (2 pts)
Let’s look at the column element_code. Use a loop to solve the following questions for all 4 DataFrames:
-
Print out the (unique) elements of the column
element_code(i.e., show each element just one time). -
Find the number of times that
SNOWoccurs in theelement_codecolumn.
|
Question 5 (2 pts)
Now let us practice using the chunksize feature for big data. You may refer to this document, to get more information about chunksize.
-
Try to run the following 2 programs, to find the number of times that
SNOWoccurs in theelement_codecolumn, from the year 1880 to year 1883. Explain your understanding ofchunksize.
Pre-work for the programs:
import pandas as pd
from pathlib import Path
myfiles=[]
for year in range (1880, 1884):
file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
myfiles.append(file)
Version 1 of the program:
count = 0
for file in myfiles:
for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
count += len(myDF[myDF['element_code'] == 'SNOW'])
print(count)
Version 2 of the program:
count = 0
for file in myfiles:
for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
for index, row in myDF.iterrows():
if row['element_code'] == 'SNOW':
count += 1
print(count)
Project 03 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project03.ipynb.
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project03.py
-
-
Submit files through Gradescope
|
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |