Feature Analysis on Numerical variables#

Within the dataset, there are many variables that include “Privacy Suppressed” values. Due to the variability of these values, we have to exclude the these variables from the analysis since our dataset will be very small if we wanted to include them.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

from tools.utils import combine_columns, compute_feature_importance, standard_units
data = pd.read_csv("data/Most-Recent-Cohorts-Institution-filtered.csv").drop('Unnamed: 0', axis=1)
fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
            'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' , 
            'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN', 
            'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
            'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY', 
            'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
            ## four year specific
            'RET_FT4']

clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[fouryr_features].dropna()
compute_feature_importance(clean_data, 'RET_FT4')
Feature Importance
4 SAT_AVG 0.251925
19 AVGFACSAL 0.138262
5 CCUGPROF 0.084027
24 FAMINC 0.055792
21 PCTPELL 0.048481
10 UGDS_ASIAN 0.036045
6 CCSIZSET 0.034190
25 MD_FAMINC 0.031441
18 INEXPFTE 0.024311
27 UGDS_MEN 0.019107
28 UGDS_WOMEN 0.018132
15 UGDS_UNKN 0.017468
23 AGE_ENTRY 0.017002
1 ADM_RATE 0.015627
12 UGDS_NHPI 0.015596
22 PCTFLOAN 0.015444
11 UGDS_AIAN 0.015414
7 UGDS_WHITE 0.014427
20 PFTFAC 0.013866
17 TUITIONFEE_OUT 0.013818
8 UGDS_BLACK 0.013761
2 ST_FIPS 0.013431
9 UGDS_HISP 0.013159
29 ANP 0.012983
14 UGDS_NRA 0.012859
0 HIGHDEG 0.012233
13 UGDS_2MOR 0.012126
16 TUITIONFEE_IN 0.011745
3 LOCALE 0.008975
26 ADMCON7 0.008351

Based on the feature analysis with four-year retention rates, SAT_AVG and AVGFASCAL have some importance in determining retention rates.

less_fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
            'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' , 
            'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN', 
            'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
            'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY', 
            'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
            ## less than four year specific
            'RET_FTL4']

clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[less_fouryr_features].dropna()
compute_feature_importance(clean_data, 'RET_FTL4')
Feature Importance
9 UGDS_HISP 0.127525
5 CCUGPROF 0.104900
14 UGDS_NRA 0.101073
12 UGDS_NHPI 0.098338
11 UGDS_AIAN 0.082627
18 INEXPFTE 0.064682
8 UGDS_BLACK 0.056532
19 AVGFACSAL 0.043537
29 ANP 0.039380
17 TUITIONFEE_OUT 0.028478
26 ADMCON7 0.024814
3 LOCALE 0.023343
27 UGDS_MEN 0.022833
4 SAT_AVG 0.020617
13 UGDS_2MOR 0.019967
28 UGDS_WOMEN 0.019036
23 AGE_ENTRY 0.016918
16 TUITIONFEE_IN 0.016467
1 ADM_RATE 0.016071
15 UGDS_UNKN 0.013840
2 ST_FIPS 0.013370
22 PCTFLOAN 0.012931
10 UGDS_ASIAN 0.012137
6 CCSIZSET 0.007103
20 PFTFAC 0.005756
21 PCTPELL 0.004512
25 MD_FAMINC 0.001694
7 UGDS_WHITE 0.001222
24 FAMINC 0.000297
0 HIGHDEG 0.000000

Based on the feature analysis with less-than-four-year retention rates, the diversity of ethnicity has significant importance in determining retention rates.