Choropleth Map Plots Analysis of Relationship Between Sales and Income Data#
In this NoteBook, we will be analyzing the relationship between sales of tobacco and median income in each state in the years 2013, 2014, 2016, 2017, 2018, and 2019. We omitted the year 2015 because the original data source U.S._Chronic_Disease_Indicators__Tobacco.csv lacks the average sales data we need for analysis. We first restructured the average sales and the median income data into our desired data frame and displayed them. We then plot 6 choropleth graphs for average sales in each state in the years of 2013 to 2019 (excluding 2015), and 7 choropleth graphs for median income in each state in the years of 2013 to 2019. We put them into the outputs folder of the project. Finally, we compared the graphs that we generated, analyze them, and drew a conclusion from the graphs.
import plotly.express as px
from IPython.display import HTML
from IPython.display import Image
from project_tools.dataformat import *
Loading and Processing Data#
Average Sales Data#
# import cleaned-up tobacco average sales data
states_sales = tobacco_data_import()
# change format of data
sub = states_sales.groupby(['LocationDesc', 'LocationAbbr', 'DataValueUnit']).apply(lambda row: pd.Series({"2013": row.DataValue[row.YearStart == 2013].values[0],
"2014": row.DataValue[row.YearStart == 2014].values[0],
# ignoring 2015 because missing data in the original source
"2016": row.DataValue[row.YearStart == 2016].values[0],
"2017": row.DataValue[row.YearStart == 2017].values[0],
"2018": row.DataValue[row.YearStart == 2018].values[0],
"2019": row.DataValue[row.YearStart == 2019].values[0]}))
sub =sub.reset_index(['LocationDesc', 'LocationAbbr', 'DataValueUnit'], drop=False)
sub.head()
LocationDesc | LocationAbbr | DataValueUnit | 2013 | 2014 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | AL | pack sales per capita | 64.6 | 61.7 | 60.1 | 58.1 | 54.7 | 53.1 |
1 | Alaska | AK | pack sales per capita | 39 | 37.2 | 35.6 | 32.9 | 31.1 | 30.4 |
2 | Arizona | AZ | pack sales per capita | 24.4 | 23 | 23.1 | 22.5 | 21.8 | 20.7 |
3 | Arkansas | AR | pack sales per capita | 57.5 | 54.4 | 54.2 | 52.4 | 50.2 | 47.6 |
4 | California | CA | pack sales per capita | 23.9 | 22.7 | 22 | 20.5 | 16.6 | 15.8 |
Median Income Data#
median_income = income_data_import()
median_income.head()
Location | 2019 Median income | 2018 Median income | 2017 Median income | 2016 Median income | 2015 Median income | 2014 Median income | 2013 Median income | avg_med_income | state | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Alabama | 56200.0 | 49936.0 | 51113.0 | 47221.0 | 44509.0 | 42278.0 | 41381.0 | 47519.714286 | AL |
2 | Alaska | 78394.0 | 68734.0 | 72231.0 | 75723.0 | 75112.0 | 67629.0 | 61137.0 | 71280.000000 | AK |
3 | Arizona | 70674.0 | 62283.0 | 61125.0 | 57100.0 | 52248.0 | 49254.0 | 50602.0 | 57612.285714 | AZ |
4 | Arkansas | 54539.0 | 49781.0 | 48829.0 | 45907.0 | 42798.0 | 44922.0 | 39919.0 | 46670.714286 | AR |
5 | California | 78105.0 | 70489.0 | 69759.0 | 66637.0 | 63636.0 | 60487.0 | 57528.0 | 66663.000000 | CA |
Map plot for avg sales per state from 2013 - 2019 (6 graphs; 2015 omitted)#
# avg sales (sales per capita) for each year (6 in total)
# Year 2015 omitted
Years = [2013, 2014, 2016, 2017, 2018, 2019]
year = 0
# plotting 6 graphs in 2013, 2014, 2016, 2017, 2018, and 2019
for index in range(3, 9):
ddf = sub.iloc[:, index].astype(float)
fig = px.choropleth(sub,
locations="LocationAbbr",
hover_name="LocationDesc",
color= ddf.values,
locationmode='USA-states',
color_continuous_scale='greens',
range_color = (80000, 40000))
fig.update_layout(title_text = f"Sales per capita in {Years[year]}", geo_scope='usa')
fig.write_html(f"output/avgSales{Years[year]}.html")
year += 1
# fig.show()
# show output plot for 2019
HTML(filename = 'output/avgSales2019.html')
Map plot for median income from 2013 - 2019 (7 graphs)#
# median income graph for each year (6 in total)
Years1 = [2013, 2014, 2015, 2016, 2017, 2018, 2019]
year = 0
# plotting 7 graphs in 2013 to 2019
for index in reversed(range(1, 8)):
ddf = median_income.iloc[:, index].astype(float)
fig = px.choropleth(median_income,
locations="state",
hover_name="Location",
color=ddf.values,
locationmode='USA-states',
color_continuous_scale='reds',
range_color = (80000, 40000))
fig.update_layout(title_text = f"Median Income in {Years1[year]}", geo_scope='usa')
fig.write_html(f"output/medianIncome{Years1[year]}.html")
year += 1
# fig.show()
# show output plot for 2019
HTML(filename = 'output/medianIncome2019.html')
Analysis#
We only displayed one graph each for average sales and median income, otherwise the notebook is too big to display. The readers can find other graphs generated for years 2014, 2015, 2016, 2017, 2018, and 2019 for average sales and median income in the output folder.
As we can see from the graphs of average sales, the allocation of which states have the most tobacco pack sales per capita hasn’t changed much, with the state of New Hampshire, Kentucky, West Virginia, and Missouri having relatively the most packs sold in the years between 2013 and 2019 (excluding 2015). However, in the Median income graphs we see that the states with the highest median income are usually New Hampshire, Massachusetts, Connecticut, New Jersey, and Maryland. Just by looking at these data, it seems like there’s no direct relationship between the average amount of pack sold and median income in that state. A higher median income of a state doesn’t necessarily correlate with more packs of tobacco sold, and same with the opposite. We can see that the states with least amount of packs sold (New York, Utah, Washington, and California) are not the ones with the lowest median income (Mississippi, New Mexico, Louisiana, Arkansas, West Virginia, etc).
Therefore, we failed to find any significant or meaningful relationship between the sales and income in each state just by visualizing the graphs and understanding which states has higher sales per capita and which states has higher median income. And thus we went one step further and conducted a regression analysis on the two data in the next Notebook.