Four-Year Retention Rates#

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn as sns
import pandas as pd
import scipy.stats as stats

from tools.utils import filtered_df_two_columns, pearson_corr_coef, prediction_analysis, regression_analysis_results, calculate_MSE

Given the preliminary feature analysis conducted in the EDA.ipynb, we have found that the following variables affect retention rates.

  • Four-year retention rates

    • SAT_AVG: Average SAT equivalent score of students admitted

    • AVGFASCAL: Average faculty salary

    • PAR_ED_PCT_HS : Percent of students whose parents’ highest educational level is high school

    • PAR_ED_PCT_PS: Percent of students whose parents’ highest educational level was is some form of postsecondary education

  • Less-than-four-year retention rates

    • Ethnic Diversity (UGDS) : Total share of enrollment of undergraduate degree-seeking students who are [specific race]

Now, we are going to conduct further analysis including linear regression in order to explore the true relationship between these features and retention rates.

Loading in Data#

# Load in cleaned data
data = pd.read_csv("data/Most-Recent-Cohorts-Institution-filtered.csv")
data = data.drop('Unnamed: 0', axis=1)

SAT_AVG: Average SAT equivalent score of students admitted#

FT4_institutions = data[data['RET_FT4'].notnull()][['RET_FT4', 'SAT_AVG']]
FTL4_institutions = data[data['RET_FTL4'].notnull()][['RET_FTL4', 'SAT_AVG']]
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,4))
sns.histplot(data=FT4_institutions, x='SAT_AVG', edgecolor='white', stat='density', ax=ax1)
ax1.set_title('Distribution of Average SAT Scores')
sns.scatterplot(data=FT4_institutions, x='SAT_AVG', y='RET_FT4', ax=ax2)
ax2.set_title('Average SAT Scores vs. Retention Rates')
plt.savefig('figures/SAT_AVG_VS_RET_FT4.png');
_images/24301593a026f9773d2ad45049a3fc22bd5c6a35cdf9260ca8e25d9ea6b738fe.png

The average SAT score for four year institutions is around 1100. There is a positive relationship between average SAT score and retention rate. In other words, as the average SAT scores increase, the retention rate also increases.

SAT_score_FT4 = filtered_df_two_columns(data, 'SAT_AVG', 'RET_FT4')
SAT_score_FT4.head()
SAT_AVG RET_FT4
0 959.0 0.5403
1 1245.0 0.8640
3 1300.0 0.8180
4 938.0 0.6202
5 1262.0 0.8723
pearson_corr_coef(SAT_score_FT4.SAT_AVG, SAT_score_FT4.RET_FT4)
array([[1.20105587e+03, 6.97777529e-01],
       [6.97777529e-01, 8.34092182e-04]])
regression_analysis_results(SAT_score_FT4.RET_FT4, prediction_analysis(SAT_score_FT4), 'SATactual_vs_SATpredicted.png')
_images/eba1de2446a20f3e177b08983eac4d92bd62caaa7c69fef82c38aa8523015db4.png
0.4860228610596219

Based on the R^2 value, a linear model based on SAT average scores explains 0.49 of the variance within retention rates.

calculate_MSE(SAT_score_FT4.RET_FT4, prediction_analysis(SAT_score_FT4))
0.017792116415510515

Thus, due to the low MSE, we believe that average SAT score is a good predictor of four-year retention rates. Based on the regression, higher SAT average score indicates higher retention rates.

AVGFACSAL: Average faculty salary#

AVGFACSAL_FT4 = filtered_df_two_columns(data, 'AVGFACSAL', 'RET_FT4')
AVGFACSAL_FT4.head()
AVGFACSAL RET_FT4
0 7599.0 0.5403
1 11380.0 0.8640
2 4545.0 0.5000
3 9697.0 0.8180
4 7194.0 0.6202

Before using linear models, we visualize the relationship between average faculty salary and student retention and find that it is, in fact, linear.

sns.scatterplot(data=AVGFACSAL_FT4, x="AVGFACSAL", y="RET_FT4")
plt.savefig("figures/avgfacsal_retft4_scatter.png")
_images/5afd07fd00148d9e5e0a4a326cfc0ec3c53b2f2195d4ad3771b2628aa5a32e41.png
pearson_corr_coef(AVGFACSAL_FT4.AVGFACSAL, AVGFACSAL_FT4.RET_FT4)
array([[1.66224401e+04, 3.66145436e-01],
       [3.66145436e-01, 6.02185469e-05]])
regression_analysis_results(AVGFACSAL_FT4.RET_FT4, prediction_analysis(AVGFACSAL_FT4), 'AVGFACSALactual_vs_AVGFACSALpredicted.png')
_images/d03b871cc239843b9daaba57b7502b6f754e61c8f525a1d9b5871c9fbaf61eb9.png
0.1339313356752908

Based on the R^2 value, a linear model based on faculty salary explains 0.13 of the variance within retention rates. While it is not the best predictor, there is a general upward trend within the actual vs predicted values, signifying some sort of predicting power.

calculate_MSE(AVGFACSAL_FT4.RET_FT4, prediction_analysis(AVGFACSAL_FT4))
0.03094448077224073

Thus, including the low MSE, we conclude that there is a weak but present relationship between faculty salary and retention at four-year institutions. While faculty salary increases, retention rates increase at the general trend as well.

PAR_ED_PCT_HS and PAR_ED_PCT_PS: Percent of students whose parents’ highest educational level is high school/postsecondary school#

As noted in our EDA, these two features have high amounts of null and “PrivacySuppressed” values. We will need to drop these values in order to continue our analysis.

PAR_HS_RET_FT4 = data[["PAR_ED_PCT_HS", "RET_FT4"]].replace("PrivacySuppressed", np.nan).dropna().astype(float)
PAR_PS_RET_FT4 = data[["PAR_ED_PCT_PS", "RET_FT4"]].replace("PrivacySuppressed", np.nan).dropna().astype(float)

Next, we will find the correlation between parent’s highschool education and college education respectively with retention at 4 year universities.

no_privacy_suppressed = data[["PAR_ED_PCT_PS", "PAR_ED_PCT_HS", "RET_FT4"]].replace("PrivacySuppressed", np.nan).dropna().astype(float)

corr_HS, p_value_HS = stats.pearsonr(no_privacy_suppressed['PAR_ED_PCT_HS'],  no_privacy_suppressed['RET_FT4'])
corr_PS, p_value_PS = stats.pearsonr(no_privacy_suppressed["PAR_ED_PCT_PS"],  no_privacy_suppressed['RET_FT4'])

print("Correlation of proportion of parents' being highschool educated with retention: " , corr_HS)
print("Correlation of proportion of parents' having some college education with retention: " , corr_PS)
Correlation of proportion of parents' being highschool educated with retention:  -0.5113620230360538
Correlation of proportion of parents' having some college education with retention:  0.47308519546666317

Both also have linear relationships with retention as shown in EDA. Therefore, we can continue with a linear model.

regression_analysis_results(PAR_HS_RET_FT4.RET_FT4, prediction_analysis(PAR_HS_RET_FT4), 'PAR_ED_HSactual_vs_PAR_ED_HSpredicted.png')
_images/2a86607bf6537c251fe3e84ba7b5761a41235a7142ee08527694c8767071f075.png
0.2614911186035256

Based on the R^2 value, a linear model based on Parent Education (high school) explains 0.26 of the variance within retention rates. While this R^2 value is low, the relationship between actual vs predicted values showcase some sort of predictive power.

calculate_MSE(PAR_HS_RET_FT4.RET_FT4, prediction_analysis(PAR_HS_RET_FT4))
0.03096682761276435

Thus, the low MSE signifies that our predicted retention values based on Parent Education (high school) is not far off from the actual values. This means that there is an increasing linear relationship between the two variables.

regression_analysis_results(PAR_PS_RET_FT4.RET_FT4, prediction_analysis(PAR_PS_RET_FT4), 'PAR_ED_PSactual_vs_PAR_ED_PSpredicted.png')
_images/ba14b6dbac90036d36802218750c6f11c7817eef193e4409be4c60d6f299abc9.png
0.19753850951573393

Based on the R^2 value, a linear model based on Parent Education (postsecondary school) explains 0.20 of the variance within retention rates. While this R^2 value is lower than Parent Education (high school), the relationship between actual vs. predicted values showcase some sort of predictive power while not as strong.

calculate_MSE(PAR_PS_RET_FT4.RET_FT4, prediction_analysis(PAR_PS_RET_FT4))
0.02902789934662409

However, the MSE of retention rates based on Parent Education (postsecondary school) is actually lower than the MSE of retention rates based on Parent Education (high school). This indicates that our predicted values have less of an error compared to the previous predicted values. With this in mind, we conclude that both Parent Education variables have some sort of predictive power in retention rates with strengths in different areas.