Feature Analysis on Numerical variables#
Within the dataset, there are many variables that include “Privacy Suppressed” values. Due to the variability of these values, we have to exclude the these variables from the analysis since our dataset will be very small if we wanted to include them.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from tools.utils import combine_columns, compute_feature_importance, standard_units
data = pd.read_csv("data/Most-Recent-Cohorts-Institution-filtered.csv").drop('Unnamed: 0', axis=1)
fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' ,
'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN',
'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY',
'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
## four year specific
'RET_FT4']
clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[fouryr_features].dropna()
compute_feature_importance(clean_data, 'RET_FT4')
Feature | Importance | |
---|---|---|
4 | SAT_AVG | 0.251925 |
19 | AVGFACSAL | 0.138262 |
5 | CCUGPROF | 0.084027 |
24 | FAMINC | 0.055792 |
21 | PCTPELL | 0.048481 |
10 | UGDS_ASIAN | 0.036045 |
6 | CCSIZSET | 0.034190 |
25 | MD_FAMINC | 0.031441 |
18 | INEXPFTE | 0.024311 |
27 | UGDS_MEN | 0.019107 |
28 | UGDS_WOMEN | 0.018132 |
15 | UGDS_UNKN | 0.017468 |
23 | AGE_ENTRY | 0.017002 |
1 | ADM_RATE | 0.015627 |
12 | UGDS_NHPI | 0.015596 |
22 | PCTFLOAN | 0.015444 |
11 | UGDS_AIAN | 0.015414 |
7 | UGDS_WHITE | 0.014427 |
20 | PFTFAC | 0.013866 |
17 | TUITIONFEE_OUT | 0.013818 |
8 | UGDS_BLACK | 0.013761 |
2 | ST_FIPS | 0.013431 |
9 | UGDS_HISP | 0.013159 |
29 | ANP | 0.012983 |
14 | UGDS_NRA | 0.012859 |
0 | HIGHDEG | 0.012233 |
13 | UGDS_2MOR | 0.012126 |
16 | TUITIONFEE_IN | 0.011745 |
3 | LOCALE | 0.008975 |
26 | ADMCON7 | 0.008351 |
Based on the feature analysis with four-year retention rates, SAT_AVG
and AVGFASCAL
have some importance in determining retention rates.
less_fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' ,
'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN',
'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY',
'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
## less than four year specific
'RET_FTL4']
clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[less_fouryr_features].dropna()
compute_feature_importance(clean_data, 'RET_FTL4')
Feature | Importance | |
---|---|---|
9 | UGDS_HISP | 0.127525 |
5 | CCUGPROF | 0.104900 |
14 | UGDS_NRA | 0.101073 |
12 | UGDS_NHPI | 0.098338 |
11 | UGDS_AIAN | 0.082627 |
18 | INEXPFTE | 0.064682 |
8 | UGDS_BLACK | 0.056532 |
19 | AVGFACSAL | 0.043537 |
29 | ANP | 0.039380 |
17 | TUITIONFEE_OUT | 0.028478 |
26 | ADMCON7 | 0.024814 |
3 | LOCALE | 0.023343 |
27 | UGDS_MEN | 0.022833 |
4 | SAT_AVG | 0.020617 |
13 | UGDS_2MOR | 0.019967 |
28 | UGDS_WOMEN | 0.019036 |
23 | AGE_ENTRY | 0.016918 |
16 | TUITIONFEE_IN | 0.016467 |
1 | ADM_RATE | 0.016071 |
15 | UGDS_UNKN | 0.013840 |
2 | ST_FIPS | 0.013370 |
22 | PCTFLOAN | 0.012931 |
10 | UGDS_ASIAN | 0.012137 |
6 | CCSIZSET | 0.007103 |
20 | PFTFAC | 0.005756 |
21 | PCTPELL | 0.004512 |
25 | MD_FAMINC | 0.001694 |
7 | UGDS_WHITE | 0.001222 |
24 | FAMINC | 0.000297 |
0 | HIGHDEG | 0.000000 |
Based on the feature analysis with less-than-four-year retention rates, the diversity of ethnicity has significant importance in determining retention rates.