Understanding the Data

# Always remember to install important packages!
import pandas as pd
import numpy as np

# ------ DO NOT REMOVE ------ #

grades_df = pd.DataFrame({
    "Name": ["John", "Bob", "Charlie", "Diana", "Edward", "Fiona", "George",
             "Hannah", "Ian", "Ian", "Liam", "Mia", "Noah", "Olivia", "Sophia"],
    "Homework": [95.0, 82.5, 75.3, 91.2, np.nan, 88.6, 92.3, 85.7, 70.1, 70.1,
                 78.9, 94.2, np.nan, 86.7, 90.8],
    "Exam Score": [85.5, "Not Graded", 65.3, 88.9, 72.2, 92.1, 94.1, 83, 70, 70, 79.7, 91.4, 80.1, 87.4, "Not Graded"],
    "Office Hours": ["Yes", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No",
                     "Yes", "No", "Yes", "No", "Yes"],
    "Final Grade": [90.5, 78.2, 65.8, 88.3, 72.4, 91.0, 94.3, 83.5, 68.6, 68.6,
                    79.7, 90.1, 80.5, 87.4, 82.3]
})

# ------ DO NOT REMOVE ------ #

# 1. Use the .info() function to identify data types and missing values
grades_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          15 non-null     object 
 1   Homework      13 non-null     float64
 2   Exam Score    15 non-null     object 
 3   Office Hours  15 non-null     object 
 4   Final Grade   15 non-null     float64
dtypes: float64(2), object(3)
memory usage: 732.0+ bytes

# 2. Return the dataset to observe its structure
grades_df

	Name	Homework	Exam Score	Office Hours	Final Grade
0	John	95.0	85.5	Yes	90.5
1	Bob	82.5	Not Graded	No	78.2
2	Charlie	75.3	65.3	Yes	65.8
3	Diana	91.2	88.9	Yes	88.3
4	Edward	NaN	72.2	No	72.4
5	Fiona	88.6	92.1	No	91.0
6	George	92.3	94.1	Yes	94.3
7	Hannah	85.7	83	Yes	83.5
8	Ian	70.1	70	No	68.6
9	Ian	70.1	70	No	68.6
10	Liam	78.9	79.7	Yes	79.7
11	Mia	94.2	91.4	No	90.1
12	Noah	NaN	80.1	Yes	80.5
13	Olivia	86.7	87.4	No	87.4
14	Sophia	90.8	Not Graded	Yes	82.3

Our data is a bit messy! We can see missing values in the “Homework” and “Exam Score” volumns as well as the categorical data in the “Office Hours” column.

Duplicate Values

# 3. Check how many duplicate rows are in the dataset (if any)
grades_df.duplicated().sum()

# 4. Drop duplicate rows (if any)
grades_df.drop_duplicates(inplace=True)
grades_df

	Name	Homework	Exam Score	Office Hours	Final Grade
0	John	95.0	85.5	Yes	90.5
1	Bob	82.5	Not Graded	No	78.2
2	Charlie	75.3	65.3	Yes	65.8
3	Diana	91.2	88.9	Yes	88.3
4	Edward	NaN	72.2	No	72.4
5	Fiona	88.6	92.1	No	91.0
6	George	92.3	94.1	Yes	94.3
7	Hannah	85.7	83	Yes	83.5
8	Ian	70.1	70	No	68.6
10	Liam	78.9	79.7	Yes	79.7
11	Mia	94.2	91.4	No	90.1
12	Noah	NaN	80.1	Yes	80.5
13	Olivia	86.7	87.4	No	87.4
14	Sophia	90.8	Not Graded	Yes	82.3

Missing Values

The missing values in the “Homework” column are represented by “NaN” which can be handled easier.

# 5. Replace missing values in the "Homework" column with the average
# homework score
grades_df["Homework"].fillna(grades_df["Homework"].mean().round(2),
                             inplace=True)
grades_df

FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  grades_df["Homework"].fillna(grades_df["Homework"].mean().round(2),

	Name	Homework	Exam Score	Office Hours	Final Grade
0	John	95.00	85.5	Yes	90.5
1	Bob	82.50	Not Graded	No	78.2
2	Charlie	75.30	65.3	Yes	65.8
3	Diana	91.20	88.9	Yes	88.3
4	Edward	85.94	72.2	No	72.4
5	Fiona	88.60	92.1	No	91.0
6	George	92.30	94.1	Yes	94.3
7	Hannah	85.70	83	Yes	83.5
8	Ian	70.10	70	No	68.6
10	Liam	78.90	79.7	Yes	79.7
11	Mia	94.20	91.4	No	90.1
12	Noah	85.94	80.1	Yes	80.5
13	Olivia	86.70	87.4	No	87.4
14	Sophia	90.80	Not Graded	Yes	82.3

Next, we must also handle the missing values in the “Exam Score” column. Since it is not in the form of “None” or “NaN”, we will need to handle it in a different way.

# 6. Replace missing values in the "Exam Score" column with the average
# exam score
grades_df["Exam Score"] = grades_df["Exam Score"].replace("Not Graded", np.nan)

grades_df.fillna(grades_df["Exam Score"].mean(), inplace=True)

grades_df

	Name	Homework	Exam Score	Office Hours	Final Grade
0	John	95.00	85.500	Yes	90.5
1	Bob	82.50	82.475	No	78.2
2	Charlie	75.30	65.300	Yes	65.8
3	Diana	91.20	88.900	Yes	88.3
4	Edward	85.94	72.200	No	72.4
5	Fiona	88.60	92.100	No	91.0
6	George	92.30	94.100	Yes	94.3
7	Hannah	85.70	83.000	Yes	83.5
8	Ian	70.10	70.000	No	68.6
10	Liam	78.90	79.700	Yes	79.7
11	Mia	94.20	91.400	No	90.1
12	Noah	85.94	80.100	Yes	80.5
13	Olivia	86.70	87.400	No	87.4
14	Sophia	90.80	82.475	Yes	82.3

Factoring Categorical Data

# 7. Turn the categorical data in the "Office Hours" columns to numerical
grades_df["Office Hours"] = np.where(grades_df["Office Hours"] == "Yes", 1, 0)
grades_df

	Name	Homework	Exam Score	Office Hours	Final Grade
0	John	95.00	85.500	1	90.5
1	Bob	82.50	82.475	0	78.2
2	Charlie	75.30	65.300	1	65.8
3	Diana	91.20	88.900	1	88.3
4	Edward	85.94	72.200	0	72.4
5	Fiona	88.60	92.100	0	91.0
6	George	92.30	94.100	1	94.3
7	Hannah	85.70	83.000	1	83.5
8	Ian	70.10	70.000	0	68.6
10	Liam	78.90	79.700	1	79.7
11	Mia	94.20	91.400	0	90.1
12	Noah	85.94	80.100	1	80.5
13	Olivia	86.70	87.400	0	87.4
14	Sophia	90.80	82.475	1	82.3

Removing Columns

# 8. Remove the "Name" column from our dataset
grades_df.drop(columns="Name", inplace=True)
grades_df

	Homework	Exam Score	Office Hours	Final Grade
0	95.00	85.500	1	90.5
1	82.50	82.475	0	78.2
2	75.30	65.300	1	65.8
3	91.20	88.900	1	88.3
4	85.94	72.200	0	72.4
5	88.60	92.100	0	91.0
6	92.30	94.100	1	94.3
7	85.70	83.000	1	83.5
8	70.10	70.000	0	68.6
10	78.90	79.700	1	79.7
11	94.20	91.400	0	90.1
12	85.94	80.100	1	80.5
13	86.70	87.400	0	87.4
14	90.80	82.475	1	82.3