## DO NOT EDIT
import pandas as pd
import numpy as np
import altair as alt
import scipy.stats as sps
from vega_datasets import data
'html') alt.renderers.enable(
RendererRegistry.enable('html')
DS Collaborative, Fall 2024. Presented by: Ethan P. Marzban
## DO NOT EDIT
import pandas as pd
import numpy as np
import altair as alt
import scipy.stats as sps
from vega_datasets import data
alt.renderers.enable('html')
RendererRegistry.enable('html')
As mentioned during the workshop lecture, today’s workshop will primarily make use of the plotting library “Altair”. You can read more about Altair at https://altair-viz.github.io/.
Let’s start by making a simple scatterplot. First, we need data! Create a pandas dataframe consisting of two columns, called x
and y
. Populate the x
column with all integers between -3 and 3 (inclusive), and populate the y
column with the squares of the corresponding x
values. Assign the data frame to a variable (pick the name yourself!) The first few rows of your data frame should look like
x | y |
---|---|
-3 | 9 |
-2 | 4 |
-1 | 1 |
## Replace this cell with your answers
my_df = pd.DataFrame({
'x': np.arange(-3, 4),
'y': np.arange(-3, 4) ** 2
})
my_df
x | y | |
---|---|---|
0 | -3 | 9 |
1 | -2 | 4 |
2 | -1 | 1 |
3 | 0 | 0 |
4 | 1 | 1 |
5 | 2 | 4 |
6 | 3 | 9 |
Now that we have data, we can make a graph! Remember that Altair follows the Grammar of Graphics, which posits that visualizations can be broken down into: - Axes: the axes (or axis) of the plot - Geoms: the shapes/objects that comprise the plot (e.g. points for scatterplots, bars for bargraphs and histograms, etc.) - Aesthetics: mappings from the data to the geoms (e.g. coordinates for the points in the scatterplot, length of bars in a bargraph, etc.)
Further recall that different variable types best correspond to different plot types. Given that our data frame from above consists of two numerical variables, what type of plot is most appropriate to visualize the data? Once you’ve answered this question, use Altair to generate the appropriate plot. Don’t worry about formatting too much yet.
Though this is a nice enough plot, perhaps we want to highlight the fact that these points lie along a parabola. As such, let’s work toward adding a parabolic curve to our plot that passes through all the points.
There are several ways to do this; let’s work through one together. First, let’s focus on generating the graph of the function \(f(x) = x^2\) on the interval \([-3, 3]\). One way to do this is to create a very fine set of points between \(-3\) and \(3\), square each, and then create a line graph of the resulting plot. Because the x
-values are so close together, the line segments will appear smooth when viewed holistically.
## Replace this cell with your answers
temp_df = pd.DataFrame({
'x': np.linspace(-3, 3, num = 100),
'y': np.linspace(-3, 3, num = 100) ** 2
})
parabola1 = alt.Chart(temp_df).encode(x = 'x', y = 'y')
parabola1.mark_line()
Finally, we can layer our parabola onto our initial plot. Though we could use the alt.layer()
function, we can actually also use the +
operator!
## Replace this cell with your answers
scatter1 = alt.Chart(my_df).encode(
x = 'x',
y = 'y'
)
scatter1.mark_point(size = 100, filled = True) + parabola1.mark_line()
Finally, let’s add some formatting to this plot (and a title)!
Let’s now work with some real data! To avoid having to worry about downloading and uploading data from external sources, we’ll be working with one of the built-in datasets from the vega_datasets
module.
Our specific dataset contains annual net generation of electricity in the state of Iowa by source in thousand megawatthours, between 2001 and 2017. To import and save the dataframe as a variable called iowa
, run the following cell:
Now, display the first 10 rows of the dataframe. (Yes, this is a bit of a review from previous workshops!)
year | source | net_generation | |
---|---|---|---|
0 | 2001-01-01 | Fossil Fuels | 35361 |
1 | 2002-01-01 | Fossil Fuels | 35991 |
2 | 2003-01-01 | Fossil Fuels | 36234 |
3 | 2004-01-01 | Fossil Fuels | 36205 |
4 | 2005-01-01 | Fossil Fuels | 36883 |
5 | 2006-01-01 | Fossil Fuels | 37014 |
6 | 2007-01-01 | Fossil Fuels | 41389 |
7 | 2008-01-01 | Fossil Fuels | 42734 |
8 | 2009-01-01 | Fossil Fuels | 38620 |
9 | 2010-01-01 | Fossil Fuels | 42750 |
Let’s work toward visualizing the net generation across the three sources (Fossil Fuels, Nuclear Energy, and Renewables), aggregated across all 17 years invluded in the dataset.
First, what type of plot do you think would be most appropriate?
Now, what we would like to do is aggregate across years, but within each source. We’ll talk more about how to do this in the next workshop (on Data Tidying) - for now, I’ll just mention that we can achieve our result by grouping the dataframe using the pd.groupby()
method.
source | net_generation | |
---|---|---|
0 | Fossil Fuels | 620129 |
1 | Nuclear Energy | 80103 |
2 | Renewables | 164220 |
Finally, use this dataframe to generate the desired plot. Adjust the plot to have appropriate font sizes and dimensions - also include a title.
## Replace this cell with your answers
alt.Chart(
aggregate_gen,
title = "Aggregated Generation Across Sectors"
).mark_bar().encode(
x = 'source',
y = 'net_generation'
).properties(
width = 500
).configure_axis(
labelFontSize = 16,
titleFontSize = 18
).configure_title(
fontSize = 18
)
/usr/local/lib/python3.10/dist-packages/altair/utils/core.py:384: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)
It may be interesting to view the annual expenditures changes over time. Produce an appropriate graphic, and interpret.
## Replace this cell with your answers
alt.Chart(
iowa,
title = "Expenditure Over Time"
).mark_line(point = True).encode(
x = 'year',
y = 'net_generation',
color = 'source'
).properties(
width = 500
).configure_axis(
labelFontSize = 16,
titleFontSize = 18
).configure_title(
fontSize = 18
)
/usr/local/lib/python3.10/dist-packages/altair/utils/core.py:384: FutureWarning: the convert_dtype parameter is deprecated and will be removed in a future version. Do ``ser.astype(object).apply()`` instead if you want ``convert_dtype=False``.
col = df[col_name].apply(to_list_if_array, convert_dtype=False)