Data Visualization: Stats R. Fun, #3

DS Collaborative, Fall 2024. Presented by: Ethan P. Marzban

## DO NOT EDIT
import pandas as pd
import numpy as np
import altair as alt
import scipy.stats as sps

from vega_datasets import data
alt.renderers.enable('html')
RendererRegistry.enable('html')

Part 1: Basic Visualizations in Altair

As mentioned during the workshop lecture, today’s workshop will primarily make use of the plotting library “Altair”. You can read more about Altair at https://altair-viz.github.io/.

Let’s start by making a simple scatterplot. First, we need data! Create a pandas dataframe consisting of two columns, called x and y. Populate the x column with all integers between -3 and 3 (inclusive), and populate the y column with the squares of the corresponding x values. Assign the data frame to a variable (pick the name yourself!) The first few rows of your data frame should look like

x y
-3 9
-2 4
-1 1
## Replace this cell with your answers
my_df = pd.DataFrame({
    'x': np.arange(-3, 4),
    'y': np.arange(-3, 4) ** 2
})

my_df
x y
0 -3 9
1 -2 4
2 -1 1
3 0 0
4 1 1
5 2 4
6 3 9

Now that we have data, we can make a graph! Remember that Altair follows the Grammar of Graphics, which posits that visualizations can be broken down into: - Axes: the axes (or axis) of the plot - Geoms: the shapes/objects that comprise the plot (e.g. points for scatterplots, bars for bargraphs and histograms, etc.) - Aesthetics: mappings from the data to the geoms (e.g. coordinates for the points in the scatterplot, length of bars in a bargraph, etc.)

Further recall that different variable types best correspond to different plot types. Given that our data frame from above consists of two numerical variables, what type of plot is most appropriate to visualize the data? Once you’ve answered this question, use Altair to generate the appropriate plot. Don’t worry about formatting too much yet.

## Replace this cell with your answers
alt.Chart(
    my_df
).mark_point().encode(
    x = 'x',
    y = 'y'
)

Though this is a nice enough plot, perhaps we want to highlight the fact that these points lie along a parabola. As such, let’s work toward adding a parabolic curve to our plot that passes through all the points.

There are several ways to do this; let’s work through one together. First, let’s focus on generating the graph of the function \(f(x) = x^2\) on the interval \([-3, 3]\). One way to do this is to create a very fine set of points between \(-3\) and \(3\), square each, and then create a line graph of the resulting plot. Because the x-values are so close together, the line segments will appear smooth when viewed holistically.

## Replace this cell with your answers
temp_df = pd.DataFrame({
    'x': np.linspace(-3, 3, num = 100),
    'y':  np.linspace(-3, 3, num = 100) ** 2
})

parabola1 = alt.Chart(temp_df).encode(x = 'x', y = 'y')
parabola1.mark_line()

Finally, we can layer our parabola onto our initial plot. Though we could use the alt.layer() function, we can actually also use the + operator!

## Replace this cell with your answers

scatter1 = alt.Chart(my_df).encode(
    x = 'x',
    y = 'y'
)

scatter1.mark_point(size = 100, filled = True) + parabola1.mark_line()

Finally, let’s add some formatting to this plot (and a title)!

## Replace this cell with your answers
plot1 = scatter1.mark_point(size = 100, filled = True) + parabola1.mark_line()
plot1.properties(title = "My First Plot").configure_axis(
    labelFontSize = 14,
    titleFontSize = 16
).configure_title(
    fontSize = 18
)

Part 2: Iowa Electricity Dataset

Let’s now work with some real data! To avoid having to worry about downloading and uploading data from external sources, we’ll be working with one of the built-in datasets from the vega_datasets module.

Our specific dataset contains annual net generation of electricity in the state of Iowa by source in thousand megawatthours, between 2001 and 2017. To import and save the dataframe as a variable called iowa, run the following cell:

iowa = data.iowa_electricity()

Now, display the first 10 rows of the dataframe. (Yes, this is a bit of a review from previous workshops!)

## Replace this cell with your answers
iowa.head(10)
year source net_generation
0 2001-01-01 Fossil Fuels 35361
1 2002-01-01 Fossil Fuels 35991
2 2003-01-01 Fossil Fuels 36234
3 2004-01-01 Fossil Fuels 36205
4 2005-01-01 Fossil Fuels 36883
5 2006-01-01 Fossil Fuels 37014
6 2007-01-01 Fossil Fuels 41389
7 2008-01-01 Fossil Fuels 42734
8 2009-01-01 Fossil Fuels 38620
9 2010-01-01 Fossil Fuels 42750

Aggregating Across Years; Comparing Across Sectors

Let’s work toward visualizing the net generation across the three sources (Fossil Fuels, Nuclear Energy, and Renewables), aggregated across all 17 years invluded in the dataset.

First, what type of plot do you think would be most appropriate?

Now, what we would like to do is aggregate across years, but within each source. We’ll talk more about how to do this in the next workshop (on Data Tidying) - for now, I’ll just mention that we can achieve our result by grouping the dataframe using the pd.groupby() method.

aggregate_gen = iowa.groupby('source', as_index=False).sum('net_generation')
aggregate_gen
source net_generation
0 Fossil Fuels 620129
1 Nuclear Energy 80103
2 Renewables 164220

Finally, use this dataframe to generate the desired plot. Adjust the plot to have appropriate font sizes and dimensions - also include a title.

## Replace this cell with your answers
alt.Chart(
    aggregate_gen,
    title = "Aggregated Generation Across Sectors"
).mark_bar().encode(
    x = 'source',
    y = 'net_generation'
).properties(
    width = 500
).configure_axis(
    labelFontSize = 16,
    titleFontSize = 18
).configure_title(
    fontSize = 18
)
/usr/local/lib/python3.10/dist-packages/altair/utils/core.py:384: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)

Changes Over Time

It may be interesting to view the annual expenditures changes over time. Produce an appropriate graphic, and interpret.

## Replace this cell with your answers
alt.Chart(
    iowa,
    title = "Expenditure Over Time"
).mark_line(point = True).encode(
    x = 'year',
    y = 'net_generation',
    color = 'source'
).properties(
    width = 500
).configure_axis(
    labelFontSize = 16,
    titleFontSize = 18
).configure_title(
    fontSize = 18
)
/usr/local/lib/python3.10/dist-packages/altair/utils/core.py:384: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version.  Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)