# Import NumPy and Pandas
import pandas as pd
import numpy as np
Getting Started
Create a copy of the worksheet by going in the top left and selecting File -> Save a copy in Drive
= pd.DataFrame({
movies "Movie": ["Cars 2", "The Exorcist", "Click"],
"Rating": [4.2, 4.1, 4.7],
"Genre": ["Family", "Horror", "Comedy"]
})
0] movies.iloc[
0 | |
---|---|
Movie | Cars 2 |
Rating | 4.2 |
Genre | Family |
NumPy Arrays
Arrays have the same underlying structure as lists! We can construct them using np.array() and index them the exact same way
# Create two NumPy arrays; one from 1-5, and another from 5-10
= np.array([1,2,3,4,5])
my_array1 = np.array([6,7,8,9,10]) my_array2
# Retrieve the first element from your arrays
0], my_array2[0] my_array1[
(1, 6)
We can do mathematical operations between arrays such as addition and subtraction!
# Add, subtract and multiply your arrays
+ my_array2 my_array1
array([-5, -5, -5, -5, -5])
- my_array2 my_array1
array([-5, -5, -5, -5, -5])
* my_array2 my_array1
array([ 6, 14, 24, 36, 50])
# Retrieve the first 3 elements from your arrays
0:3] my_array1[
array([1, 2, 3])
# Find the sum of all elements in your arrays
sum() my_array2.
40
# Use np.linspace() to create an array of values of 0-20, in intervals of 5
= np.linspace(start=0, stop=20, num=5)
fives fives
array([ 0., 5., 10., 15., 20.])
# Create a 3x3 array of values from 1-9
= np.array([[1,2,3],
big_array 4,5,6],
[7,8,9]])
[ big_array
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Confirm that your array has dimensions of 3x3
big_array.shape
(3, 3)
# Recreate the same array but with a float data type instead of integer
= np.array([[1,2,3],
big_array2 4,5,6],
[7,8,9]], dtype="float")
[
big_array2
array([[1., 2., 3.],
[4., 5., 6.],
[7., 8., 9.]])
# Confirm that your array have a float data type
big_array2.dtype
dtype('float64')
Pandas Data Frames
Pandas Data Frames have the same underlying structure as Python dictionaries. We can create them by using pd.DataFrame() and index them the exact same way
# Create a data frame with two columns; one with your three favorite foods and
# another the values 1-3
= pd.DataFrame({
my_df "favorite foods": ["pizza", "chicken bake", "melatonin gummies"],
"numbers": [1,2,3]
})
my_df
favorite foods | numbers | |
---|---|---|
0 | pizza | 1 |
1 | chicken bake | 2 |
2 | melatonin gummies | 3 |
# Retrieve your favorite foods from your data frame
"favorite foods"] my_df[
favorite foods | |
---|---|
0 | pizza |
1 | chicken bake |
2 | melatonin gummies |
# Retrieve the first row of your data frame
0] my_df.iloc[
0 | |
---|---|
favorite foods | pizza |
numbers | 1 |
Important Functions:
Examining your data: - .head(): shows first 5 observations of data
.info(): shows number of rows, columns, blank (“null”) values, and the data types of each variable
.dtype: shows the underlying data type (integers, floats, etc.)
Analyzing your data: - .min(): Minimum value of a column - .max(): Maximum value of a column - .mean(): Average value of a column - .median(): Median value of a column - .sum(): Sum of all values in a column - .corr(): Shows correlation between values (NOTE: negative correlation does not mean less correlation, refer to workshop slides!) - .value_counts(): Shows number of observations per value in a column - .nunique(): Shows the amount of unique observations in a column
# Import the bestsellers dataset and view the head
= pd.read_csv("bestsellers with categories.csv")
books books.head()
Name | Author | User Rating | Reviews | Price | Year | Genre | |
---|---|---|---|---|---|---|---|
0 | 10-Day Green Smoothie Cleanse | JJ Smith | 4.7 | 17350 | 8 | 2016 | Non Fiction |
1 | 11/22/63: A Novel | Stephen King | 4.6 | 2052 | 22 | 2011 | Fiction |
2 | 12 Rules for Life: An Antidote to Chaos | Jordan B. Peterson | 4.7 | 18979 | 15 | 2018 | Non Fiction |
3 | 1984 (Signet Classics) | George Orwell | 4.7 | 21424 | 6 | 2017 | Fiction |
4 | 5,000 Awesome Facts (About Everything!) (Natio... | National Geographic Kids | 4.8 | 7665 | 12 | 2019 | Non Fiction |
books.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 550 non-null object
1 Author 550 non-null object
2 User Rating 550 non-null float64
3 Reviews 550 non-null int64
4 Price 550 non-null int64
5 Year 550 non-null int64
6 Genre 550 non-null object
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB
# Let's check the type of our dataset!
type(books)
pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None
Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure. Parameters ---------- data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs. If data is a list of dicts, column order follows insertion-order. index : Index or array-like Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided. columns : Index or array-like Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels, will perform column selection instead. dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer. copy : bool or None, default None Copy data from inputs. For dict data, the default of None behaves like ``copy=True``. For DataFrame or 2d ndarray input, the default of None behaves like ``copy=False``. If data is a dict containing one or more Series (possibly of different dtypes), ``copy=False`` will ensure that these inputs are not copied. .. versionchanged:: 1.3.0 See Also -------- DataFrame.from_records : Constructor from tuples, also record arrays. DataFrame.from_dict : From dicts of Series, arrays, or dicts. read_csv : Read a comma-separated values (csv) file into DataFrame. read_table : Read general delimited file into DataFrame. read_clipboard : Read text from clipboard into DataFrame. Notes ----- Please reference the :ref:`User Guide <basics.dataframe>` for more information. Examples -------- Constructing DataFrame from a dictionary. >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df col1 col2 0 1 3 1 2 4 Notice that the inferred dtype is int64. >>> df.dtypes col1 int64 col2 int64 dtype: object To enforce a single dtype: >>> df = pd.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object Constructing DataFrame from a dictionary including Series: >>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])} >>> pd.DataFrame(data=d, index=[0, 1, 2, 3]) col1 col2 0 0 NaN 1 1 NaN 2 2 2.0 3 3 3.0 Constructing DataFrame from numpy ndarray: >>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), ... columns=['a', 'b', 'c']) >>> df2 a b c 0 1 2 3 1 4 5 6 2 7 8 9 Constructing DataFrame from a numpy ndarray that has labeled columns: >>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)], ... dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]) >>> df3 = pd.DataFrame(data, columns=['c', 'a']) ... >>> df3 c a 0 3 1 1 6 4 2 9 7 Constructing DataFrame from dataclass: >>> from dataclasses import make_dataclass >>> Point = make_dataclass("Point", [("x", int), ("y", int)]) >>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)]) x y 0 0 0 1 0 3 2 2 3 Constructing DataFrame from Series/DataFrame: >>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"]) >>> df = pd.DataFrame(data=ser, index=["a", "c"]) >>> df 0 a 1 c 3 >>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"]) >>> df2 = pd.DataFrame(data=df1, index=["a", "c"]) >>> df2 x a 1 c 3
# Retrieve the "User Rating" column from our dataset
"User Rating"].head() books[
User Rating | |
---|---|
0 | 4.7 |
1 | 4.6 |
2 | 4.7 |
3 | 4.7 |
4 | 4.8 |
# Find the *data type* of the "User Rating" column
"User Rating"].dtype books[
dtype('float64')
# Find the average amount of reviews in this dataset
"Reviews"].mean() books[
11953.281818181818
# Find the cheapest and most expensive book prices in our dataset
"Price"].max(), books["Price"].min() books[
(105, 0)
# Find the names of the cheapest and most expensive books
"Price"] == books["Price"].max()]["Name"] books[books[
Name | |
---|---|
69 | Diagnostic and Statistical Manual of Mental Di... |
70 | Diagnostic and Statistical Manual of Mental Di... |
"Price"] == books["Price"].min()]["Name"] books[books[
Name | |
---|---|
42 | Cabin Fever (Diary of a Wimpy Kid, Book 6) |
71 | Diary of a Wimpy Kid: Hard Luck, Book 8 |
116 | Frozen (Little Golden Book) |
193 | JOURNEY TO THE ICE P |
219 | Little Blue Truck |
358 | The Constitution of the United States |
381 | The Getaway |
461 | The Short Second Life of Bree Tanner: An Eclip... |
505 | To Kill a Mockingbird |
506 | To Kill a Mockingbird |
507 | To Kill a Mockingbird |
508 | To Kill a Mockingbird |
# Which authors have produced the most bestselling books
"Author"].value_counts().head() books[
count | |
---|---|
Author | |
Jeff Kinney | 12 |
Gary Chapman | 11 |
Rick Riordan | 11 |
Suzanne Collins | 11 |
American Psychological Association | 10 |
# Find the books that have a user rating less than 4
"User Rating"] < 4] books[books[
Name | Author | User Rating | Reviews | Price | Year | Genre | |
---|---|---|---|---|---|---|---|
22 | Allegiant | Veronica Roth | 3.9 | 6310 | 13 | 2013 | Fiction |
106 | Fifty Shades of Grey: Book One of the Fifty Sh... | E L James | 3.8 | 47265 | 14 | 2012 | Fiction |
107 | Fifty Shades of Grey: Book One of the Fifty Sh... | E L James | 3.8 | 47265 | 14 | 2013 | Fiction |
132 | Go Set a Watchman: A Novel | Harper Lee | 3.6 | 14982 | 19 | 2015 | Fiction |
353 | The Casual Vacancy | J.K. Rowling | 3.3 | 9372 | 12 | 2012 | Fiction |
392 | The Goldfinch: A Novel (Pulitzer Prize for Fic... | Donna Tartt | 3.9 | 33844 | 20 | 2013 | Fiction |
393 | The Goldfinch: A Novel (Pulitzer Prize for Fic... | Donna Tartt | 3.9 | 33844 | 20 | 2014 | Fiction |
# How many of those books have less than 10,000 reviews?
"User Rating"] < 4) & (books["Reviews"] < 10000)] books[(books[
Name | Author | User Rating | Reviews | Price | Year | Genre | |
---|---|---|---|---|---|---|---|
22 | Allegiant | Veronica Roth | 3.9 | 6310 | 13 | 2013 | Fiction |
353 | The Casual Vacancy | J.K. Rowling | 3.3 | 9372 | 12 | 2012 | Fiction |
# Find the correlation between User Rating and Reviews
"User Rating", "Reviews"]].corr() books[[
User Rating | Reviews | |
---|---|---|
User Rating | 1.000000 | -0.001729 |
Reviews | -0.001729 | 1.000000 |