Getting Started

Create a copy of the worksheet by going in the top left and selecting File -> Save a copy in Drive

# Import NumPy and Pandas
import pandas as pd
import numpy as np

movies = pd.DataFrame({
    "Movie": ["Cars 2", "The Exorcist", "Click"],
    "Rating": [4.2, 4.1, 4.7],
    "Genre": ["Family", "Horror", "Comedy"]
})

movies.iloc[0]

	0
Movie	Cars 2
Rating	4.2
Genre	Family

dtype: object

NumPy Arrays

Arrays have the same underlying structure as lists! We can construct them using np.array() and index them the exact same way

# Create two NumPy arrays; one from 1-5, and another from 5-10

my_array1 = np.array([1,2,3,4,5])
my_array2 = np.array([6,7,8,9,10])

# Retrieve the first element from your arrays
my_array1[0], my_array2[0]

(1, 6)

We can do mathematical operations between arrays such as addition and subtraction!

# Add, subtract and multiply your arrays
my_array1 + my_array2

array([-5, -5, -5, -5, -5])

my_array1 - my_array2

array([-5, -5, -5, -5, -5])

my_array1 * my_array2

array([ 6, 14, 24, 36, 50])

# Retrieve the first 3 elements from your arrays
my_array1[0:3]

array([1, 2, 3])

# Find the sum of all elements in your arrays
my_array2.sum()

# Use np.linspace() to create an array of values of 0-20, in intervals of 5
fives = np.linspace(start=0, stop=20, num=5)
fives

array([ 0.,  5., 10., 15., 20.])

# Create a 3x3 array of values from 1-9
big_array = np.array([[1,2,3],
                     [4,5,6],
                     [7,8,9]])
big_array

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

# Confirm that your array has dimensions of 3x3
big_array.shape

(3, 3)

# Recreate the same array but with a float data type instead of integer
big_array2 = np.array([[1,2,3],
                     [4,5,6],
                     [7,8,9]], dtype="float")

big_array2

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

# Confirm that your array have a float data type
big_array2.dtype

dtype('float64')

Pandas Data Frames

Pandas Data Frames have the same underlying structure as Python dictionaries. We can create them by using pd.DataFrame() and index them the exact same way

# Create a data frame with two columns; one with your three favorite foods and
# another the values 1-3

my_df = pd.DataFrame({
    "favorite foods": ["pizza", "chicken bake", "melatonin gummies"],
    "numbers": [1,2,3]
})

my_df

	favorite foods	numbers
0	pizza	1
1	chicken bake	2
2	melatonin gummies	3

# Retrieve your favorite foods from your data frame
my_df["favorite foods"]

	favorite foods
0	pizza
1	chicken bake
2	melatonin gummies

dtype: object

# Retrieve the first row of your data frame
my_df.iloc[0]

	0
favorite foods	pizza
numbers	1

dtype: object

Important Functions:

Examining your data: - .head(): shows first 5 observations of data

.info(): shows number of rows, columns, blank (“null”) values, and the data types of each variable
.dtype: shows the underlying data type (integers, floats, etc.)

Analyzing your data: - .min(): Minimum value of a column - .max(): Maximum value of a column - .mean(): Average value of a column - .median(): Median value of a column - .sum(): Sum of all values in a column - .corr(): Shows correlation between values (NOTE: negative correlation does not mean less correlation, refer to workshop slides!) - .value_counts(): Shows number of observations per value in a column - .nunique(): Shows the amount of unique observations in a column

# Import the bestsellers dataset and view the head
books = pd.read_csv("bestsellers with categories.csv")
books.head()

	Name	Author	User Rating	Reviews	Price	Year	Genre
0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction
1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction
2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction
3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction
4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction

books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         550 non-null    object 
 1   Author       550 non-null    object 
 2   User Rating  550 non-null    float64
 3   Reviews      550 non-null    int64  
 4   Price        550 non-null    int64  
 5   Year         550 non-null    int64  
 6   Genre        550 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 30.2+ KB

# Let's check the type of our dataset!
type(books)

pandas.core.frame.DataFrame
def __init__(data=None, index: Axes | None=None, columns: Axes | None=None, dtype: Dtype | None=None, copy: bool | None=None) -> None

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.pyTwo-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
    Dict can contain Series, arrays, constants, dataclass or list-like objects. If
    data is a dict, column order follows insertion-order. If a dict contains Series
    which have an index defined, it is aligned by its index. This alignment also
    occurs if data is a Series or a DataFrame itself. Alignment is done on
    Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

index : Index or array-like
    Index to use for resulting frame. Will default to RangeIndex if
    no indexing information part of input data and no index provided.
columns : Index or array-like
    Column labels to use for resulting frame when data does not have them,
    defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
    will perform column selection instead.
dtype : dtype, default None
    Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool or None, default None
    Copy data from inputs.
    For dict data, the default of None behaves like ``copy=True``.  For DataFrame
    or 2d ndarray input, the default of None behaves like ``copy=False``.
    If data is a dict containing one or more Series (possibly of different dtypes),
    ``copy=False`` will ensure that these inputs are not copied.

    .. versionchanged:: 1.3.0

See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.

Notes
-----
Please reference the :ref:`User Guide <basics.dataframe>` for more information.

Examples
--------
Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3

>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3

# Retrieve the "User Rating" column from our dataset
books["User Rating"].head()

	User Rating
0	4.7
1	4.6
2	4.7
3	4.7
4	4.8

dtype: float64

# Find the *data type* of the "User Rating" column
books["User Rating"].dtype

dtype('float64')

# Find the average amount of reviews in this dataset
books["Reviews"].mean()

11953.281818181818

# Find the cheapest and most expensive book prices in our dataset

books["Price"].max(), books["Price"].min()

(105, 0)

# Find the names of the cheapest and most expensive books
books[books["Price"] == books["Price"].max()]["Name"]

	Name
69	Diagnostic and Statistical Manual of Mental Di...
70	Diagnostic and Statistical Manual of Mental Di...

dtype: object

books[books["Price"] == books["Price"].min()]["Name"]

	Name
42	Cabin Fever (Diary of a Wimpy Kid, Book 6)
71	Diary of a Wimpy Kid: Hard Luck, Book 8
116	Frozen (Little Golden Book)
193	JOURNEY TO THE ICE P
219	Little Blue Truck
358	The Constitution of the United States
381	The Getaway
461	The Short Second Life of Bree Tanner: An Eclip...
505	To Kill a Mockingbird
506	To Kill a Mockingbird
507	To Kill a Mockingbird
508	To Kill a Mockingbird

dtype: object

# Which authors have produced the most bestselling books
books["Author"].value_counts().head()

	count
Author
Jeff Kinney	12
Gary Chapman	11
Rick Riordan	11
Suzanne Collins	11
American Psychological Association	10

dtype: int64

# Find the books that have a user rating less than 4
books[books["User Rating"] < 4]

	Name	Author	User Rating	Reviews	Price	Year	Genre
22	Allegiant	Veronica Roth	3.9	6310	13	2013	Fiction
106	Fifty Shades of Grey: Book One of the Fifty Sh...	E L James	3.8	47265	14	2012	Fiction
107	Fifty Shades of Grey: Book One of the Fifty Sh...	E L James	3.8	47265	14	2013	Fiction
132	Go Set a Watchman: A Novel	Harper Lee	3.6	14982	19	2015	Fiction
353	The Casual Vacancy	J.K. Rowling	3.3	9372	12	2012	Fiction
392	The Goldfinch: A Novel (Pulitzer Prize for Fic...	Donna Tartt	3.9	33844	20	2013	Fiction
393	The Goldfinch: A Novel (Pulitzer Prize for Fic...	Donna Tartt	3.9	33844	20	2014	Fiction

# How many of those books have less than 10,000 reviews?
books[(books["User Rating"] < 4) & (books["Reviews"] < 10000)]

	Name	Author	User Rating	Reviews	Price	Year	Genre
22	Allegiant	Veronica Roth	3.9	6310	13	2013	Fiction
353	The Casual Vacancy	J.K. Rowling	3.3	9372	12	2012	Fiction

# Find the correlation between User Rating and Reviews
books[["User Rating", "Reviews"]].corr()

	User Rating	Reviews
User Rating	1.000000	-0.001729
Reviews	-0.001729	1.000000