Analysis Overview

Analysis Overview#

In this project, we aim to answer:

Which locations (and even streets, boroughs, and zip codes) in NY have a higher proportion of car accidents?

# !pip install geopandas
# !pip install geoplot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from tool import utils
%matplotlib inline
from datetime import datetime

Importing Data#

The dataset we are working with is from the City of New York’s data catalog on motor vehicle crashes. The data contains information from all police reported motor vehicle collisions in NYC in 2012-2023. For the purposes of memory, we only use 2022 data. Click here to download the complete dataset. Be aware that this dataset is quite large and is of 403 Mb.

# read csv into dataframe, specifying the year
crashes = utils.read_csv_of_year(2022)
crashes.head(5)

	Unnamed: 0	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	...	CONTRIBUTING FACTOR VEHICLE 2	CONTRIBUTING FACTOR VEHICLE 3	CONTRIBUTING FACTOR VEHICLE 4	CONTRIBUTING FACTOR VEHICLE 5	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5
0	1	2022-03-26	11:45	NaN	NaN	NaN	NaN	NaN	QUEENSBORO BRIDGE UPPER	NaN	...	NaN	NaN	NaN	NaN	4513547	Sedan	NaN	NaN	NaN	NaN
1	2	2022-06-29	6:55	NaN	NaN	NaN	NaN	NaN	THROGS NECK BRIDGE	NaN	...	Unspecified	NaN	NaN	NaN	4541903	Sedan	Pick-up Truck	NaN	NaN	NaN
2	34	2022-06-29	16:00	NaN	NaN	NaN	NaN	NaN	WILLIAMSBURG BRIDGE OUTER ROADWA	NaN	...	Unspecified	NaN	NaN	NaN	4542336	Motorscooter	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN
3	37	2022-07-12	17:50	BROOKLYN	11225.0	40.663303	-73.96049	(40.663303, -73.96049)	NaN	NaN	...	Unspecified	NaN	NaN	NaN	4545699	Sedan	NaN	NaN	NaN	NaN
4	38	2022-03-23	10:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4512922	Bike	NaN	NaN	NaN	NaN

5 rows × 30 columns

Mapping#

Dropping NaNs in Locations column#

crashes["LOCATION"].isna().sum()

crashes["LOCATION"].isna().sum()/crashes.shape[0]*100

8.611868538448938

crashes_locations_map = crashes.dropna(subset=["LOCATION"])
# crashes_locations_map.head()

Plotting total casualties#

crashes_locations_map["TOTAL CASUALTIES"] = crashes_locations_map["NUMBER OF PERSONS INJURED"] + crashes_locations_map["NUMBER OF PERSONS KILLED"]

/tmp/ipykernel_41499/4161608646.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crashes_locations_map["TOTAL CASUALTIES"] = crashes_locations_map["NUMBER OF PERSONS INJURED"] + crashes_locations_map["NUMBER OF PERSONS KILLED"]

crashes_locations_map_2plus = crashes_locations_map[crashes_locations_map["TOTAL CASUALTIES"] > 2]

fig = px.scatter_mapbox(crashes_locations_map_2plus,
                        lat="LATITUDE",
                        lon="LONGITUDE",
                        color="TOTAL CASUALTIES",
                        size="TOTAL CASUALTIES",
                        zoom=8,
                        height=800,
                        width=1000, color_continuous_scale='thermal')

fig.update_layout(mapbox_style="open-street-map")
fig.show()

Clustering Analysis#

features = crashes_locations_map[['LATITUDE', 'LONGITUDE']]
features

	LATITUDE	LONGITUDE
3	40.663303	-73.960490
5	40.607685	-74.138920
6	40.855972	-73.869896
7	40.790276	-73.939600
8	40.642986	-74.016210
...	...	...
103782	0.000000	0.000000
103783	40.877476	-73.836610
103784	40.690180	-73.935600
103785	40.694485	-73.937350
103786	40.687750	-73.790390

94849 rows × 2 columns

from sklearn.cluster import KMeans
# create kmeans model/object
kmeans = KMeans(
    init="random",
    n_clusters=8,
    n_init=10,
    max_iter=300,
    random_state=42
)

# do clustering
kmeans.fit(features)
# save results
labels = kmeans.labels_

# send back into dataframe and display it
crashes_locations_map['cluster'] = labels
# display the number of mamber each clustering
_clusters = crashes_locations_map.groupby('cluster').count()['COLLISION_ID']
# print(_clusters)

/tmp/ipykernel_41499/4122996422.py:2: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

fig = px.scatter_mapbox(crashes_locations_map,
                        lat="LATITUDE",
                        lon="LONGITUDE",
                        color="cluster",
                        size="TOTAL CASUALTIES",
                        zoom=8,
                        height=800,
                        width=1000, color_continuous_scale='thermal')

fig.update_layout(mapbox_style="open-street-map")
fig.show()

EDA#

Collisions by Geography#

# function to add value labels
# credit: GeeksforGeeks
def addlabels(x, y):
    for i in range(len(x)):
        plt.text(i, y[i]*1.01, y[i], ha="center")

Boroughs#

boroughs_data = crashes.groupby("BOROUGH").count()["COLLISION_ID"]

# population source: http://www.citypopulation.de/en/usa/newyorkcity/
population = pd.Series(data={'BRONX': 1379946, 'BROOKLYN': 2590516, 'MANHATTAN': 1596273, 'QUEENS': 2278029, 'STATEN ISLAND': 491133})

per capita

x = boroughs_data.index
y = round(boroughs_data.values / population * 1000, 2)

plt.bar(x, y)
addlabels(x, y)
plt.xlabel("Boroughs")
plt.ylabel("Collisions per Thousand")
plt.title("Collisions per Thousand Teople by Borough")
plt.ylim((4, 10))
plt.savefig("Charts/Collisions per Thousand People by Borough.png")
plt.show()

_images/db885ee495039eea01baab06f4dbfb31ace8cc7adb784656276ad3222003320b.png

Zip Code#

zc_data = crashes.groupby("ZIP CODE").count()["CRASH DATE"]

zc_df = pd.DataFrame(zc_data).reset_index()
zc_df.rename(columns={"CRASH DATE": "COUNT"}, inplace=True)
zc_df["ZIP CODE"] = zc_df["ZIP CODE"].astype(int)
zc_df.sort_values(by="COUNT", ascending=False)

	ZIP CODE	COUNT
123	11207	1763
128	11212	1128
151	11236	1111
124	11208	1099
119	11203	996
...	...	...
55	10153	1
56	10154	1
58	10158	1
59	10165	1
57	10155	1

210 rows × 2 columns

Collisions by Demography#

# credit: https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html

types = ("Pedestrians", "Motorists", "Cyclists")
counts = {"Injured": (crashes["NUMBER OF PEDESTRIANS INJURED"].sum(), crashes["NUMBER OF MOTORIST INJURED"].sum(), crashes["NUMBER OF CYCLIST INJURED"].sum()),
          "Killed": (crashes["NUMBER OF PEDESTRIANS KILLED"].sum(), crashes["NUMBER OF MOTORIST KILLED"].sum(), crashes["NUMBER OF CYCLIST KILLED"].sum()),
          }

x = np.arange(len(types))  # the label locations
width = 0.25  # the width of the bars
multiplier = 0

fig, ax = plt.subplots(layout='constrained')

for i, j in counts.items():
    offset = width * multiplier
    rects = ax.bar(x + offset, j, width, label=i)
    ax.bar_label(rects, padding=3)
    multiplier += 1

ax.set_ylabel('Count')
ax.set_title('Type of person')
ax.set_xticks(x + width*0.5, types)
ax.legend()
ax.set_ylim(0, 42000)
plt.savefig("Charts/Collisons by Vehicle")
plt.show()

_images/152bab260d7bf6487918a57e971eb06f2bc6a84257f2ae836825cb64fdad7c5a.png

Collisions by Causes#

Top 10 Contributing Factor Vehicle 1#

cfv = utils.get_contributing_factor(crashes).sort_values("CRASH DATE", ascending=False).iloc[1:11]
labels = cfv.index
sizes = cfv["CRASH DATE"]
fig, ax = plt.subplots(figsize=(10, 7))
ax.pie(sizes, labels=labels, autopct='%1.1f%%')
# plt.legend(labels, bbox_to_anchor=(1.5, 0.25, 0.5, 0.5))
plt.savefig("Charts/Collisions by Contributing Factor")
plt.show()

_images/96eae25bdd203505e424b8f69c6e7fa3f3de73add1a248438ceccafc3b0fed2f.png

Top 10 Vehicle Type Code 1#

# vtc1 = crashes.groupby("VEHICLE TYPE CODE 1").count()["CRASH DATE"].sort_values(ascending=False).head(10)
vtc = utils.get_vehicle_type(crashes).sort_values("CRASH DATE", ascending=False).iloc[:10]
labels = vtc.index
sizes = vtc["CRASH DATE"]

fig, ax = plt.subplots(figsize=(10, 7))
ax.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.savefig("Charts/Collisions by Vehicle Type Code 1")
plt.show()

_images/710924e0b54bdcc47699db3c12dbebbfd367421c8123a939abf10704a44e2753.png

Collisions by Time#

# add month, day and hour to the dataframe
utils.time_processing(crashes)

crashes['CRASH HOUR'][0]

crashes

	Unnamed: 0	CRASH DATE	CRASH TIME	BOROUGH	ZIP CODE	LATITUDE	LONGITUDE	LOCATION	ON STREET NAME	CROSS STREET NAME	...	COLLISION_ID	VEHICLE TYPE CODE 1	VEHICLE TYPE CODE 2	VEHICLE TYPE CODE 3	VEHICLE TYPE CODE 4	VEHICLE TYPE CODE 5	CRASH MONTH	CRASH DAY	CRASH HOUR	DAYOFWEEK
0	1	2022-03-26	11:45	NaN	NaN	NaN	NaN	NaN	QUEENSBORO BRIDGE UPPER	NaN	...	4513547	Sedan	NaN	NaN	NaN	NaN	03	26	11	5
1	2	2022-06-29	6:55	NaN	NaN	NaN	NaN	NaN	THROGS NECK BRIDGE	NaN	...	4541903	Sedan	Pick-up Truck	NaN	NaN	NaN	06	29	6	2
2	34	2022-06-29	16:00	NaN	NaN	NaN	NaN	NaN	WILLIAMSBURG BRIDGE OUTER ROADWA	NaN	...	4542336	Motorscooter	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	06	29	16	2
3	37	2022-07-12	17:50	BROOKLYN	11225.0	40.663303	-73.96049	(40.663303, -73.96049)	NaN	NaN	...	4545699	Sedan	NaN	NaN	NaN	NaN	07	12	17	1
4	38	2022-03-23	10:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4512922	Bike	NaN	NaN	NaN	NaN	03	23	10	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103782	1984602	2022-04-24	13:00	BROOKLYN	11233.0	0.000000	0.00000	(0.0, 0.0)	NaN	NaN	...	4621496	Sedan	Station Wagon/Sport Utility Vehicle	Sedan	NaN	NaN	04	24	13	6
103783	1984724	2022-12-06	15:52	BRONX	10475.0	40.877476	-73.83661	(40.877476, -73.83661)	BAYCHESTER AVENUE	GIVAN AVENUE	...	4621506	Sedan	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	12	06	15	1
103784	1984763	2022-05-26	0:00	BROOKLYN	11221.0	40.690180	-73.93560	(40.69018, -73.9356)	NaN	NaN	...	4621498	Station Wagon/Sport Utility Vehicle	Station Wagon/Sport Utility Vehicle	NaN	NaN	NaN	05	26	0	3
103785	1984817	2022-07-18	19:00	BROOKLYN	11206.0	40.694485	-73.93735	(40.694485, -73.93735)	LEWIS AVENUE	HART STREET	...	4621495	Bike	NaN	NaN	NaN	NaN	07	18	19	0
103786	1985218	2022-12-11	12:00	QUEENS	11434.0	40.687750	-73.79039	(40.68775, -73.79039)	LINDEN BOULEVARD	157 STREET	...	4622208	Station Wagon/Sport Utility Vehicle	fire truck	NaN	NaN	NaN	12	11	12	6

103787 rows × 34 columns

Month#

month_data = crashes.groupby("CRASH MONTH").count()["CRASH TIME"]
plt.figure(figsize=(8, 6))
# sns.lineplot(month_data.index, month_data.values)
plt.plot(month_data.index, month_data.values)
plt.title("Collisions by Month")
plt.xlabel("Month")
plt.ylabel("Count")
plt.savefig("Charts/Collisions by Month.png")
plt.show()

_images/bf362c2bb8415f9b8ba0e0104cf2e94af21cc7a03b61402615a2226094246322.png

Day#

day_data = crashes.groupby("DAYOFWEEK").count()["CRASH TIME"]
plt.figure(figsize=(8, 6))
plt.bar(day_data.index, day_data.values)
addlabels(day_data.index, day_data.values)
plt.title("Collisions by Day")
plt.xlabel("Day")
plt.ylabel("Count")
plt.ylim((10000, 18000))
plt.savefig("Charts/Collisions by Day")
plt.show()

_images/0e9b8443a9c22cf426d59f64d424c55bdcaf4feef1a81781fc60387ebfd72cba.png

Hour#

weekday_hour_data = crashes[crashes['DAYOFWEEK'] < 5].groupby("CRASH HOUR").count()["CRASH TIME"]/5
weekend_hour_data = crashes[crashes['DAYOFWEEK'] > 4].groupby("CRASH HOUR").count()["CRASH TIME"]/2

plt.figure(figsize=(8, 6))
plt.plot(weekday_hour_data.index, weekday_hour_data.values, label='weekday')
plt.plot(weekend_hour_data.index, weekend_hour_data.values, label='weekend')
plt.title("Collisions Weekday vs. Weekend")
plt.xlabel("Hour")
plt.ylabel("Count")
plt.xticks(np.linspace(0, 23, 24))
plt.legend()
plt.savefig("Charts/Collisions Weekday vs. Weekend.png")
plt.show()

_images/6fa125c888660f85a92ab8ab481b3ee675feb0db09a911434c49a5a2cad6e023.png

Bivariate analysis:#

Investigate relationships between pairs of features, using scatter plots, box plots, or violin plots. For example, you can explore the relationship between contributing factors and the number of persons injured, or between crash time and crash severity.

crashes_18_22 = dict.fromkeys(range(2014, 2023))
for year in crashes_18_22.keys():
    crashes_18_22[year] = utils.read_csv_of_year(year)
    utils.time_processing(crashes_18_22[year])
    α = 1
    crashes_18_22[year]["TOTAL CASUALTIES"] = crashes_18_22[year]["NUMBER OF PERSONS INJURED"] + α * crashes_18_22[year]["NUMBER OF PERSONS KILLED"]

casualties_18_22 = dict.fromkeys(range(2014, 2023))
for year in casualties_18_22.keys():
    casualties_18_22[year] = crashes_18_22[year].groupby("CRASH HOUR")["TOTAL CASUALTIES"].mean()

/home/jovyan/project-group19/tool/utils.py:86: DtypeWarning:

Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.

/home/jovyan/project-group19/tool/utils.py:86: DtypeWarning:

Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.

plt.figure(figsize=(8, 6))
for year in casualties_18_22.keys():
    plt.plot(casualties_18_22[year].index, casualties_18_22[year].values, label=year)
    plt.text(23.5, casualties_18_22[year].values[23], year)
plt.title("Casualties by Hour")
plt.xlabel("Hour")
plt.ylabel("Count")
plt.xticks(np.linspace(0, 23, 24))
plt.xlim((-1, 25))
plt.savefig("Charts/Collisions 2015-2022.png")
plt.show()

_images/14a4a8b802ce9681d67a1394b1c4ea439cb3c97f80ca2f0c4c4c7ca7112e774e.png

numkilled_18_22 = dict.fromkeys(range(2014, 2023))

for year in crashes_18_22.keys():
    numkilled_18_22[year] = crashes_18_22[year]["NUMBER OF PERSONS KILLED"].sum()

plt.figure(figsize=(8, 6))
for year in casualties_18_22.keys():
    plt.bar(year, numkilled_18_22[year], label=year)
    plt.text(year, numkilled_18_22[year]+2, numkilled_18_22[year], ha="center")
plt.title("Casualties by Year")
plt.xlabel("Year")
plt.ylabel("Number of Casualties")
plt.ylim((200, 300))
plt.savefig("Charts/Plot.png")
plt.show()

_images/7e549eea55ebbabfcee9a47724e3bb9600a2a1b027eccd2cdcecbfa78e4835df.png

Predicting car crash severity#

Let’s develop a classification machine learning model using a random forest. Define car accident severity as three levels according to the casualties:

minor injury level: 2 or less people injured, no death reported. Label: 0

major injury level: 3 or more people injured, no death reported. Label: 1

fatal injury level: at least one person killed. Label: 2

injury_conditions = [
    (crashes["NUMBER OF PERSONS KILLED"] >= 1),
    (crashes["NUMBER OF PERSONS INJURED"] >= 3) & (crashes["NUMBER OF PERSONS KILLED"] == 0),
    (crashes["NUMBER OF PERSONS INJURED"] <= 2) & (crashes["NUMBER OF PERSONS KILLED"] == 0),
]

injury_values = [2, 1, 0]

time_conditions = [
    (crashes['CRASH HOUR'] <= 11) & (crashes['CRASH HOUR'] >= 6),
    (crashes['CRASH HOUR'] <= 17) & (crashes['CRASH HOUR'] >= 12),
    (crashes['CRASH HOUR'] <= 5) | (crashes['CRASH HOUR'] >= 18),
]

time_values = ['morning', 'afternoon', 'after sunset']

vehicle1_conditions = [
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Sedan', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Sport', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Truck', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Taxi', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Bus', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Motor', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Bike', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].str.contains('Pick-up', na=False)),
    (crashes['VEHICLE TYPE CODE 1'].notna() | crashes['VEHICLE TYPE CODE 1'].isna())
]
vehicle1_values = ['sedan', 'suv', 'truck', 'taxi', 'bus', 'motor', 'bike', 'pick-up', 'other']

vehicle2_conditions = [
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Sedan', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Sport', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Truck', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Taxi', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Bus', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Motor', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Bike', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].str.contains('Pick-up', na=False)),
    (crashes['VEHICLE TYPE CODE 2'].notna() | crashes['VEHICLE TYPE CODE 2'].isna())
]
vehicle2_values = ['sedan', 'suv', 'truck', 'taxi', 'bus', 'motor', 'bike', 'pick-up', 'other']

# add labels of injury severity
crashes['INJURY LEVEL'] = np.select(injury_conditions, injury_values)
crashes['TIME OF DAY'] = np.select(time_conditions, time_values)
crashes['VEHICLE TYPE 1'] = np.select(vehicle1_conditions, vehicle1_values)
crashes['VEHICLE TYPE 2'] = np.select(vehicle2_conditions, vehicle2_values)

col_list = ['CONTRIBUTING FACTOR VEHICLE 1',
       'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
       'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5']
crashes['NUM OF CARS'] = 5 - crashes[col_list].isna().sum(axis=1)

data = crashes[crashes['VEHICLE TYPE 1'] != 'other']
df0 = data[data['INJURY LEVEL'] == 0].sample(20000)
df1 = data[data['INJURY LEVEL'] == 1]
df2 = data[data['INJURY LEVEL'] == 2]
frames = [df0, df1, df2]
data = pd.concat(frames)
data = data.dropna(subset=['BOROUGH'])

features = ['TIME OF DAY', 'NUM OF CARS', 'DAYOFWEEK', 'VEHICLE TYPE 1', 'VEHICLE TYPE 2', 'BOROUGH']  #
X = data[features]
y = data['INJURY LEVEL']

# one hot encoding
encoded_X = pd.get_dummies(X, drop_first=True)
encoded_X

	NUM OF CARS	DAYOFWEEK	TIME OF DAY_afternoon	TIME OF DAY_morning	VEHICLE TYPE 1_bus	VEHICLE TYPE 1_motor	VEHICLE TYPE 1_sedan	VEHICLE TYPE 1_suv	VEHICLE TYPE 1_taxi	VEHICLE TYPE 1_truck	...	VEHICLE TYPE 2_motor	VEHICLE TYPE 2_other	VEHICLE TYPE 2_sedan	VEHICLE TYPE 2_suv	VEHICLE TYPE 2_taxi	VEHICLE TYPE 2_truck	BOROUGH_BROOKLYN	BOROUGH_MANHATTAN	BOROUGH_QUEENS	BOROUGH_STATEN ISLAND
12672	1	0	0	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	1	0	0
89057	2	0	1	0	0	0	1	0	0	0	...	0	0	1	0	0	0	1	0	0	0
92419	2	1	1	0	0	0	0	1	0	0	...	0	0	0	0	0	1	0	0	1	0
85587	2	5	1	0	0	0	0	1	0	0	...	0	0	0	0	0	0	1	0	0	0
86954	2	3	0	0	0	0	1	0	0	0	...	0	0	1	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
103667	1	4	1	0	0	0	0	1	0	0	...	0	1	0	0	0	0	0	1	0	0
103734	1	4	0	0	0	0	0	1	0	0	...	0	1	0	0	0	0	0	0	0	0
103735	2	2	1	0	0	0	0	1	0	0	...	1	0	0	0	0	0	1	0	0	0
103736	2	2	1	0	0	0	0	1	0	0	...	0	0	0	0	0	0	1	0	0	0
103738	1	6	0	0	0	0	0	1	0	0	...	0	1	0	0	0	0	0	1	0	0

14826 rows × 21 columns

run on Random Forest model#

# import the necessary liabrary
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score

# train and test split and building baseline model to predict target features
X_train, X_test, y_train, y_test = train_test_split(encoded_X, y, test_size=0.2, random_state=42)

# modelling using random forest baseline
rf = RandomForestClassifier(n_estimators=500, max_depth=30, random_state=42)
rf.fit(X_train, y_train)

# predicting on test data
predics = rf.predict(X_test)

# train score
print('Training accuracy: %.2f %%' % (rf.score(X_train, y_train)*100))
# test score
print('Training accuracy: %.2f %%' % (rf.score(X_test, y_test)*100))

Training accuracy: 91.77 %
Training accuracy: 88.37 %

run on logistic regression model#

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 

scaler = StandardScaler()
lr = LogisticRegression()
model1 = Pipeline([('standardize', scaler), ('log_reg', lr)])

model1.fit(X_train, y_train)

Pipeline(steps=[('standardize', StandardScaler()),
                ('log_reg', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline(steps=[('standardize', StandardScaler()),
                ('log_reg', LogisticRegression())])

Pipeline(steps=[('standardize', StandardScaler()),
                ('log_reg', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_train_hat = model1.predict(X_train)
y_train_hat_probs = model1.predict_proba(X_train)[:, 1]
train_accuracy = accuracy_score(y_train, y_train_hat)*100
# print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat))
print('Training accuracy: %.2f %%' % train_accuracy)

Training accuracy: 89.59 %

y_test_hat = model1.predict(X_test)
y_test_hat_probs = model1.predict_proba(X_test)[:, 1]
test_accuracy = accuracy_score(y_test, y_test_hat)*100
print('Testing accuracy: %.2f %%' % test_accuracy)

Testing accuracy: 90.22 %

A test score of around 90% means that your random forest model is correctly predicting the injury severity of car accidents about 90% of the time. While this is much better than random guessing, there is still room for improvement. Here are a few suggestions on how to improve it:

Feature engineering: The features selected might not be sufficient. Could consider adding more features, such as weather conditions, road conditions, speed limits, or driver demographics.

Hyperparameter tuning: Random forest models have several hyperparameters that can significantly impact their performance. Use techniques like grid search or random search to find the optimal hyperparameters for your model (e.g., the number of trees, maximum depth, minimum samples per leaf).

Feature selection: Some features might not contribute to improving the model’s performance, or they could be causing multicollinearity. Apply feature selection techniques like recursive feature elimination, LASSO regression, or chi-squared test to find the most relevant features for your model.

Ensemble methods: Combine multiple models using techniques like bagging, boosting, or stacking to improve the overall performance.

Analysis Overview

Contents

Analysis Overview#

Importing Data#

Mapping#

Dropping NaNs in Locations column#

Plotting total casualties#

Clustering Analysis#

EDA#

Collisions by Geography#

Boroughs#

Zip Code#

Collisions by Demography#

Collisions by Causes#

Top 10 Contributing Factor Vehicle 1#

Top 10 Vehicle Type Code 1#

Collisions by Time#

Month#

Day#

Hour#

Bivariate analysis:#

Predicting car crash severity#

run on Random Forest model#

run on logistic regression model#