## Feature Analysis on Numerical variables

Within the dataset, there are many variables that include "Privacy Suppressed" values. **Due to the variability of these values, we have to exclude the these variables from the analysis since our dataset will be very small if we wanted to include them.**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

from tools.utils import combine_columns, compute_feature_importance, standard_units

In [3]:
data = pd.read_csv("data/Most-Recent-Cohorts-Institution-filtered.csv").drop('Unnamed: 0', axis=1)

In [4]:
fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
            'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' , 
            'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN', 
            'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
            'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY', 
            'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
            ## four year specific
            'RET_FT4']

clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[fouryr_features].dropna()

In [5]:
compute_feature_importance(clean_data, 'RET_FT4')

Unnamed: 0,Feature,Importance
4,SAT_AVG,0.251925
19,AVGFACSAL,0.138262
5,CCUGPROF,0.084027
24,FAMINC,0.055792
21,PCTPELL,0.048481
10,UGDS_ASIAN,0.036045
6,CCSIZSET,0.03419
25,MD_FAMINC,0.031441
18,INEXPFTE,0.024311
27,UGDS_MEN,0.019107


**Based on the feature analysis with four-year retention rates, `SAT_AVG` and `AVGFASCAL` have some importance in determining retention rates.**

In [6]:
less_fouryr_features = ['HIGHDEG', 'ADM_RATE', 'ST_FIPS', 'LOCALE', 'SAT_AVG', 'CCUGPROF', 'CCSIZSET',
            'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN' , 
            'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA','UGDS_UNKN', 
            'TUITIONFEE_IN', 'TUITIONFEE_OUT', 'INEXPFTE', 'AVGFACSAL' , 'PFTFAC',
            'PCTPELL', 'PCTFLOAN', 'AGE_ENTRY', 
            'FAMINC','MD_FAMINC', 'ADMCON7', 'UGDS_MEN', 'UGDS_WOMEN', 'ANP',
            ## less than four year specific
            'RET_FTL4']

clean_data = combine_columns(data, 'NPT4_PUB', 'NPT4_PRIV', 'ANP')[less_fouryr_features].dropna()

In [7]:
compute_feature_importance(clean_data, 'RET_FTL4')

Unnamed: 0,Feature,Importance
9,UGDS_HISP,0.127525
5,CCUGPROF,0.1049
14,UGDS_NRA,0.101073
12,UGDS_NHPI,0.098338
11,UGDS_AIAN,0.082627
18,INEXPFTE,0.064682
8,UGDS_BLACK,0.056532
19,AVGFACSAL,0.043537
29,ANP,0.03938
17,TUITIONFEE_OUT,0.028478


**Based on the feature analysis with less-than-four-year retention rates, the diversity of ethnicity has significant importance in determining retention rates.**