Income and Police Shootings

By: Marc Ojalvo and Kyle Strougo

Marc Ojalvo & Kyle Strougo

Since 2015, there has been almost 5000 cases of police brutality in the United States. With the death of George Floyd this past summer, there has been a call to action to refrom the policing system to prevent unecessary deaths from occuring. (Futher reading about these issues attached at the end of notebook)

Our project hopes to find patters in police brulaity to infrom and educate others on the most suspectible groups of potential victims. We hope to use data to find patters and valuable statitics.

Our project will be looking at the correlation between cities' income and where police shootings occur. We are attempting to answer the following questions:

1. Is there a correlation between income and police brutality?

2. Is there a correlation between income inequality (standard deviation of income) and police brutality cases?

3. What are the demographics of police brutality vicitims? Are certain groups disproportionately targeted?

We hypothesize that there are more police shootings in low-income areas than in higher-income areas and that there is a correlation between income inequality and police shootings.

The primary data set we will be looking at is the mean income of all U.S cities & towns as of 2019. The data sets include more information than just the mean income, such as the county, type, longitude, latitude, median, and standard deviation. After cleaning up the data, we decided to keep the state, city, mean, median, and standard deviation. We decided to keep both the median and mean because a difference between the two could entail skewed data. The standard deviation is a good indicator of income inequality within a city - another data point that we plan to investigate. We found this data set from https://www.kaggle.com/goldenoakresearch/us-household-income-stats-geo-locations/notebooks

Another dataset we will be analyzing is police shootings that occurred between 2015 and 2020. The data set contains the victim's name, the data, race, gender, city, and other information about the shooting. Since we are just looking at the correlation between a city's income and the number of police shootings, we decided to rid most of the data besides where the shooting occurred and the victims' names. We found this data set from https://www.kaggle.com/ahsen1330/us-police-shootings

Since the raw counts of police brutality isn't comparable, since some cities have larger population than others. To ensure our data is comparable, we imported the population of every US city and town and divided turned police brutality cases into per 10,000 people. We recived the data from https://simplemaps.com/data/us-cities.

The mean, median, and standard deviation of the city income data set will be the basis for our project. We plan to compare the number of police shootings in a city with the city's income statistics to test our hypotheses. After mapping each shooting to a city, we will determine if there is a correlation between the data points. Modifying the data to our liking will require utilizing SQL like commands made available in Pandas. We will also use visualization techniques to graph and visualize any trends.

In terms of collaboration, we have been in contact for a few weeks now. After researching what information was available online, we decided that these two datasets could work great together. The two overlap with the necessary attributes to make an analysis and correlation. We expect to find a general trend between the two and are both interested in the possible results. We will not need any more data sets to test our hypotheses.

We have been collaborating through Zoom, sharing our screens, and switching off coding while the other assists. Jupyter Notebook has been an excellent tool for debugging and working with the dataset. Furthermore, GitHub has served as an environment to hold and commit new changes to our files. We have also utilized our local editors and used excel to preview and make changes to data organization.

In [30]:
#installing packages for graphs
!pip3 install geopandas
!pip install descartes
Requirement already satisfied: geopandas in /opt/conda/lib/python3.8/site-packages (0.8.1)
Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.8/site-packages (from geopandas) (1.1.0)
Requirement already satisfied: shapely in /opt/conda/lib/python3.8/site-packages (from geopandas) (1.7.1)
Requirement already satisfied: fiona in /opt/conda/lib/python3.8/site-packages (from geopandas) (1.8.18)
Requirement already satisfied: pyproj>=2.2.0 in /opt/conda/lib/python3.8/site-packages (from geopandas) (3.0.0.post1)
Requirement already satisfied: numpy>=1.15.4 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.23.0->geopandas) (1.19.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.23.0->geopandas) (2020.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.8/site-packages (from pandas>=0.23.0->geopandas) (2.8.1)
Requirement already satisfied: click<8,>=4.0 in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (7.1.2)
Requirement already satisfied: attrs>=17 in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (19.3.0)
Requirement already satisfied: certifi in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (2020.6.20)
Requirement already satisfied: click-plugins>=1.0 in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (0.7.1)
Requirement already satisfied: munch in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (2.5.0)
Requirement already satisfied: six>=1.7 in /opt/conda/lib/python3.8/site-packages (from fiona->geopandas) (1.15.0)
Requirement already satisfied: descartes in /opt/conda/lib/python3.8/site-packages (1.1.0)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.8/site-packages (from descartes) (3.2.2)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.8/site-packages (from matplotlib->descartes) (1.19.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->descartes) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->descartes) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.8/site-packages (from matplotlib->descartes) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.8/site-packages (from matplotlib->descartes) (0.10.0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.8/site-packages (from python-dateutil>=2.1->matplotlib->descartes) (1.15.0)

DATA COLLECTION

In [2]:
#centers all graphs for orginizational purposes
from IPython.core.display import HTML as Center

Center(""" <style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style> """)
Out[2]:
In [3]:
#impoting pandas libarary
import pandas as pd
import numpy as np
import geopandas as gpd

from shapely.geometry import Point
from geopandas import GeoDataFrame
from scipy import ndimage
from scipy import stats

#imports used for heat map
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
In [4]:
#importing income data and turning it into data frame
income_df = pd.read_csv("./data/kaggle_income_edited.csv")
income_df.head()
Out[4]:
id State_Name State_ab County City Place Type Area_Code Lat Lon Mean Median Stdev sum_w
0 1011000 Alabama AL Mobile County Chickasaw Chickasaw city City 251 30.771450 -88.079697 38773 30506 33101 1638.260513
1 1011010 Alabama AL Barbour County Louisville Clio city City 334 31.708516 -85.611039 37725 19528 43789 258.017685
2 1011020 Alabama AL Shelby County Columbiana Columbiana city City 205 33.191452 -86.615618 54606 31930 57348 926.031000
3 1011030 Alabama AL Mobile County Satsuma Creola city City 251 30.874343 -88.009442 63919 52814 47707 378.114619
4 1011040 Alabama AL Mobile County Dauphin Island Dauphin Island Town 251 30.250913 -88.171268 77948 67225 54270 282.320328
In [5]:
#removed columns from income data unnessary for our analysis
income_df = income_df[['State_Name','City','Mean','Median','Stdev','Lat','Lon']]

#rename state column to match income name
income_df.rename(columns = {'State_Name':'State'}, inplace=True)

income_df
Out[5]:
State City Mean Median Stdev Lat Lon
0 Alabama Chickasaw 38773 30506 33101 30.771450 -88.079697
1 Alabama Louisville 37725 19528 43789 31.708516 -85.611039
2 Alabama Columbiana 54606 31930 57348 33.191452 -86.615618
3 Alabama Satsuma 63919 52814 47707 30.874343 -88.009442
4 Alabama Dauphin Island 77948 67225 54270 30.250913 -88.171268
... ... ... ... ... ... ... ...
32521 Puerto Rico Guaynabo 30649 13729 37977 18.397925 -66.130633
32522 Puerto Rico Aguada 15520 9923 15541 18.385424 -67.203310
32523 Puerto Rico Aguada 41933 34054 31539 18.356565 -67.180686
32524 Puerto Rico Aguada 0 0 0 18.412041 -67.213413
32525 Puerto Rico Aguadilla 28049 20229 33333 18.478094 -67.160453

32526 rows × 7 columns

For our income data, we kepts all information we though would be usefull for the analyis. We thought mean, median and standard deviation would be the most important statistics, and lat and lon for mapping purposes.

In [6]:
#importing shootings data and turning it into data frame
shootings_df = pd.read_csv("./data/shootings.csv")
shootings_df
Out[6]:
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera arms_category
0 3 Tim Elliot 2015-01-02 shot gun 53.0 M Asian Shelton WA True attack Not fleeing False Guns
1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 M White Aloha OR False attack Not fleeing False Guns
2 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23.0 M Hispanic Wichita KS False other Not fleeing False Unarmed
3 8 Matthew Hoffman 2015-01-04 shot toy weapon 32.0 M White San Francisco CA True attack Not fleeing False Other unusual objects
4 9 Michael Rodriguez 2015-01-04 shot nail gun 39.0 M Hispanic Evans CO False attack Not fleeing False Piercing objects
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4890 5916 Rayshard Brooks 2020-06-12 shot Taser 27.0 M Black Atlanta GA False attack Foot True Electrical devices
4891 5925 Caine Van Pelt 2020-06-12 shot gun 23.0 M Black Crown Point IN False attack Car False Guns
4892 5918 Hannah Fizer 2020-06-13 shot unarmed 25.0 F White Sedalia MO False other Not fleeing False Unarmed
4893 5921 William Slyter 2020-06-13 shot gun 22.0 M White Kansas City MO False other Other False Guns
4894 5924 Nicholas Hirsh 2020-06-15 shot gun 31.0 M White Lawrence KS False attack Car False Guns

4895 rows × 15 columns

In [7]:
#crating  specific data frames for future graph use 
shootings_demographics = shootings_df[['race','gender','age']]

shootings_demographics
Out[7]:
race gender age
0 Asian M 53.0
1 White M 47.0
2 Hispanic M 23.0
3 White M 32.0
4 Hispanic M 39.0
... ... ... ...
4890 Black M 27.0
4891 Black M 23.0
4892 White F 25.0
4893 White M 22.0
4894 White M 31.0

4895 rows × 3 columns

In [8]:
#removed columns from shooting data unnessary for our analysis
shootings_df = shootings_df[['city','state']]

#rename state column to match income name
shootings_df.rename(columns = {'state':'State', 'city':'City'}, inplace=True)

shootings_df
/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py:4290: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
Out[8]:
City State
0 Shelton WA
1 Aloha OR
2 Wichita KS
3 San Francisco CA
4 Evans CO
... ... ...
4890 Atlanta GA
4891 Crown Point IN
4892 Sedalia MO
4893 Kansas City MO
4894 Lawrence KS

4895 rows × 2 columns

For our police brutality data frame, we decided to seperate it into two different frames. One contianing the city and state of the case, and the other containing the demographics of the victims.

In [9]:
#importing population per city data frame and turning it into data frame
population_df = pd.read_csv('./data/uscities.csv', skipinitialspace=True)

population_df.head()
Out[9]:
city city_ascii state_id state_name county_fips county_name lat lng population density source military incorporated timezone ranking zips id
0 New York New York NY New York 36061 New York 40.6943 -73.9249 18713220 10715.0 polygon False True America/New_York 1 11229 11226 11225 11224 11222 11221 11220 1138... 1840034016
1 Los Angeles Los Angeles CA California 6037 Los Angeles 34.1139 -118.4068 12750807 3276.0 polygon False True America/Los_Angeles 1 90291 90293 90292 91316 91311 90037 90031 9000... 1840020491
2 Chicago Chicago IL Illinois 17031 Cook 41.8373 -87.6862 8604203 4574.0 polygon False True America/Chicago 1 60018 60649 60641 60640 60643 60642 60645 6064... 1840000494
3 Miami Miami FL Florida 12086 Miami-Dade 25.7839 -80.2102 6445545 5019.0 polygon False True America/New_York 1 33129 33125 33126 33127 33128 33149 33144 3314... 1840015149
4 Dallas Dallas TX Texas 48113 Dallas 32.7936 -96.7662 5743938 1526.0 polygon False True America/Chicago 1 75287 75098 75233 75254 75251 75252 75253 7503... 1840019440
In [10]:
#only keep columns that were needed
population_df = population_df[['state_name','city','population']]

#rename columns for merging purposes
population_df.rename(columns = {'state_name':'State', 'city':'City'}, inplace=True)

population_df
Out[10]:
State City population
0 New York New York 18713220
1 California Los Angeles 12750807
2 Illinois Chicago 8604203
3 Florida Miami 6445545
4 Texas Dallas 5743938
... ... ... ...
28367 California Poso Park 2
28368 Oklahoma Lotsee 2
28369 Minnesota The Ranch 2
28370 South Dakota Roswell 2
28371 Nebraska Monowi 1

28372 rows × 3 columns

For our polulation data, we only kept the 2019 population since that is all we needed to standardize our police brutality cases

DATA PROCESSING

In [11]:
#converts all State names to State codes
shootings_df['State'] = shootings_df['State'].map({
    'AL':'Alabama',
    'AK':'Alaska',
    'AZ':'Arizona',
    'AR':'Arkansas',
    'CA':'California',
    'CO':'Colorado',
    'CT':'Connecticut',
    'DE':'Delaware',
    'FL':'Florida',
    'GA':'Georgia',
    'HI':'Hawaii',
    'ID':'Idaho',
    'IL':'Illinois',
    'IN':'Indiana',
    'IA':'Iowa',
    'KS':'Kansas',
    'KY':'Kentucky',
    'LA':'Louisiana',
    'MA':'Maine',
    'MD':'Maryland',
    'MA':'Massachusetts',
    'MI':'Michigan',
    "MN":'Minnesota',
    "MS":'Mississippi',
    'MO': 'Missouri',
    'MT':'Montana',
    'NE':'Nebraska',
    'NV':'Nevada',
    'NH':'New Hampshire',
    'NJ':'New Jersey',
    'NM':'New Mexico',
    'NY':'New York',
    'NC':'North Carolina',
    'ND':'North Dakota',
    'OH':'Ohio',
    'OK':'Oklahoma',
    'OR':'Oregon',
    'PA':'Pennsylvania',
    'RI':'Rhode Island',
    'SC':'South Carolina',
    'SD':'South Dakota',
    'TN':'Tennessee',
    'TX':'Texas',
    'UT':'Utah',
    'VT':'Vermont',
    'VA':'Virginia',
    'WA':'Washington',
    'WV':'West Virginia',
    'WI':'Wisconsin',
    'WY':'Whyoming'

})

shootings_df
<ipython-input-11-77f39a93a66a>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shootings_df['State'] = shootings_df['State'].map({
Out[11]:
City State
0 Shelton Washington
1 Aloha Oregon
2 Wichita Kansas
3 San Francisco California
4 Evans Colorado
... ... ...
4890 Atlanta Georgia
4891 Crown Point Indiana
4892 Sedalia Missouri
4893 Kansas City Missouri
4894 Lawrence Kansas

4895 rows × 2 columns

We had to change the state names to their associated state code so that we could properly merge the two data sets. We want to merge both data sets on city and state since there are some city names that are used multiple times. To ensure that these duplicate city names were not confused, we had to merge on state as well. The income data set had each city using a state name, but the shootings data set had each city using the state code.

In [12]:
#count amount of shootings in each city and create data frame
shooting_counts = shootings_df['City'].value_counts().rename_axis('City').reset_index(name='counts')

#combines data frames to include shooting count
shootings_df = shootings_df.merge(shooting_counts, on=['City'], how='inner')

#creates data frame that includes shooting count per city (only counting cities with shootings)
income_shooting_df_inner = income_df.merge(shootings_df, on=['City', 'State'], how='inner')

#drop dulicate cities that came from income dataset
income_shooting_df_inner.drop_duplicates(subset=['City'], inplace=True)

#sorts by city alphabetically for orginazation purposes
income_shooting_df_inner.sort_values('City', inplace=True)

income_shooting_df_inner
Out[12]:
State City Mean Median Stdev Lat Lon counts
572 Alabama Abbeville 40518 25216 39126 31.564689 -85.259124 1
99488 North Carolina Aberdeen 71839 61786 54687 35.124230 -79.424387 2
137882 Texas Abilene 46111 24248 45339 32.483595 -99.752554 3
77691 Maryland Abingdon 132958 300000 58676 39.483132 -76.291320 2
55916 Georgia Acworth 99238 300000 52431 34.034515 -84.707349 1
... ... ... ... ... ... ... ... ...
42232 California Yucca Valley 45030 31620 40694 34.151790 -116.431300 2
51584 Florida Yulee 82905 70640 62755 30.620989 -81.522183 2
2754 Arizona Yuma 41087 39081 20446 32.709354 -114.678223 3
103208 Ohio Zanesville 28137 14971 32851 39.923188 -82.011638 1
68365 Illinois Zion 42749 35242 31203 42.440624 -87.834396 2

1522 rows × 8 columns

In [15]:
#creates data frame that includes shooting count per city (counting all cities shootings)
income_shooting_df = income_df.merge(shootings_df, on=['City', 'State'], how='left')

#replaces 0 with nans so mean is not skewed
income_shooting_df.replace(0, np.nan, inplace=True)

#groups by city/states because there are multiple city/state combos
income_shooting_df_all = income_shooting_df.groupby(['City','State']).agg({'Mean': 'mean', 'Median': 'mean', 'Stdev': 'mean', 'counts':'mean'})
income_shooting_df_all = income_shooting_df_all.reset_index()

#sorts by city alphabetically for orginazation purposes
income_shooting_df_all.sort_values('City', inplace=True)

#replaces all Nans with 0
income_shooting_df_all.replace(np.nan, 0, inplace=True)

income_shooting_df_all
Out[15]:
City State Mean Median Stdev counts
0 Abbeville Alabama 40518.000000 25216.0 39126.000000 1.0
1 Abbeville Louisiana 45970.500000 38442.5 31766.500000 0.0
2 Abbeville South Carolina 45585.666667 32245.0 41647.333333 0.0
3 Abbotsford Wisconsin 58254.000000 44919.0 48933.000000 0.0
4 Aberdeen Maryland 96480.000000 80370.0 64102.000000 0.0
... ... ... ... ... ... ...
11223 Zionsville Indiana 83345.000000 73710.0 51584.000000 0.0
11224 Zolfo Springs Florida 38331.000000 32041.0 29600.000000 0.0
11225 Zumbrota Minnesota 56065.000000 45387.0 44259.500000 0.0
11226 Zuni New Mexico 44114.000000 36914.0 41463.000000 0.0
11227 Zwolle Louisiana 34639.000000 26065.0 31901.000000 0.0

11228 rows × 6 columns

Another issue we had with our data was that the income data was separated by area code and not just cities. We originally just dropped duplicate names, and used the data for the first instance of each name. However, after doing some deep analysis, we realized that we should not be doing so because then our data wouldn’t be accurate. Instead, we grouped all instances of a city/state pair, and found the mean of the Mean, Median, and Standard Deviation.

In [16]:
#adds population to each city/state combo
income_shooting_df_all = income_shooting_df_all.merge(population_df, on=['City', 'State'], how='left')

#creates a rate column to make sure data is comparable 
income_shooting_df_all['rate'] = (income_shooting_df_all['counts']/ income_shooting_df_all['population']) * 10000

income_shooting_df_all
Out[16]:
City State Mean Median Stdev counts population rate
0 Abbeville Alabama 40518.000000 25216.0 39126.000000 1.0 2560.0 3.90625
1 Abbeville Louisiana 45970.500000 38442.5 31766.500000 0.0 19470.0 0.00000
2 Abbeville South Carolina 45585.666667 32245.0 41647.333333 0.0 5019.0 0.00000
3 Abbotsford Wisconsin 58254.000000 44919.0 48933.000000 0.0 3833.0 0.00000
4 Aberdeen Maryland 96480.000000 80370.0 64102.000000 0.0 16019.0 0.00000
... ... ... ... ... ... ... ... ...
11227 Zionsville Indiana 83345.000000 73710.0 51584.000000 0.0 28357.0 0.00000
11228 Zolfo Springs Florida 38331.000000 32041.0 29600.000000 0.0 1773.0 0.00000
11229 Zumbrota Minnesota 56065.000000 45387.0 44259.500000 0.0 3403.0 0.00000
11230 Zuni New Mexico 44114.000000 36914.0 41463.000000 0.0 NaN NaN
11231 Zwolle Louisiana 34639.000000 26065.0 31901.000000 0.0 1937.0 0.00000

11232 rows × 8 columns

After merging our income, police brutality cases, and population data frames, we had to get the rate of police brutality cases per 10,000 people. To do that, we divided the count of police brutality cases by the population of the city, and then multiplied it by 10,000. By having the rates, the data is more comparable, since some cities have such a small population but high police brutality numbers.

EXPLORATORY ANALYSIS & DATA VISUALIZATION

In [17]:
#plots the longitutate and lagititude of each shooting
geometry = [Point(xy) for xy in zip(income_shooting_df_inner['Lon'], income_shooting_df_inner['Lat'])]
gdf = GeoDataFrame(income_shooting_df_inner, geometry=geometry)

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = gdf.plot(ax=world.plot(figsize=(50, 10)), marker='o', color='red', markersize=15);

#zooms into the united states
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)

#sets map title
ax.set_title("Police shootings in the United States", fontsize=25)
Out[17]:
Text(0.5, 1.0, 'Police shootings in the United States')

In the map above we plotted the shootings in the US. Each red dot represents the location of a shooting. Clusters represent multiple shootings. If you notice the more populated cities, tend to have more shootings which makes sence. Futermore, we do not see as many police shootings in the midwest, or southwestern part of the country. Most shootings occur in the east, with the excpetion of California

In [18]:
from scipy import ndimage
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt

#code come from: https://nbviewer.jupyter.org/gist/perrygeo/c426355e40037c452434#function to create heatmap comes from 
def heatmap(d, bins=(100,100), smoothing=1.3, cmap='jet'):
    def getx(pt):
        return pt.coords[0][0]

    def gety(pt):
        return pt.coords[0][1]

    x = list(d.geometry.apply(getx))
    y = list(d.geometry.apply(gety))
    heatmap, xedges, yedges = np.histogram2d(y, x, bins=bins)
    extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]

    logheatmap = np.log(heatmap)
    logheatmap[np.isneginf(logheatmap)] = 0
    logheatmap = ndimage.filters.gaussian_filter(logheatmap, smoothing, mode='nearest')

    plt.imshow(logheatmap, cmap=cmap, extent=extent)
    plt.colorbar()
    plt.gca().invert_yaxis()
    plt.show()

heatmap(income_shooting_df_inner, bins=70, smoothing=1)
<ipython-input-18-01ce90eeba5e>:18: RuntimeWarning: divide by zero encountered in log
  logheatmap = np.log(heatmap)

Since the map before did not properly incdicate the amount of shootings per area, we decided to create a heat map to show the amount of shooting given in an rea. The hotter the color ( so the closer to red) the most shootings there are in a location. If you notice as before, most shootings occur in the east, with the exception of California.

In [19]:
#to display charts on top of eachother
plt.figure(0)

#counts race of each shooting victim
race_counts = shootings_demographics['race'].value_counts()

#plots counts into pie graph
race_counts.plot.pie(figsize=(7,7), colors=['darkgreen', 'crimson', 'pink', 'yellow', 'orange', 'brown'])
plt.title('Race of Victims ', fontsize=25)

#to display charts on top of eachother
plt.figure(1)

#counts gender of each shooting victim
gender_counts = shootings_demographics['gender'].value_counts()

#plots counts into pie graph
gender_counts.plot.pie(figsize=(7,7), colors=['blue', 'fuchsia'])
plt.title('Gender of Victims', fontsize=25)

#to display charts on top of eachother
plt.figure(2)

#plots age distribution of victimes
shootings_demographics['age'].plot.hist(bins=25, figsize=(10,5), color=['black'])
plt.title('Age distribution of Victims', fontsize=25)

#shows all the plots
plt.show()

The first pie chart above shows the proportion of shootings by race. The major take away from this graph is that proportionately speaking white people are more likely to get shot by a police than black people. However, the argument about black people experiencing worse treatment by cops than there counterparts derives from the percentage of shooting compared to there population. According to governing.com, African Americas comprise of 12.5% of the american population, but according to our stats make up approximately 25% of police brutality cases. Where as white people make up about 60% of the US polulation but only account for half of police brutality case.

The second pie chart shows the proportion of victims by race. Clearly there the majority of police brutality victims are males.

The last chart is a histogram of police brutality victims by age. The marjoity of victimes fall in the late 20s early 30 age group, with the least falling past 65.

In [20]:
#creates pivot table
race_gender = (shootings_demographics.
                   groupby(["gender", "race"])['gender'].
                   count())

#turns table in Dateframe
race_gender.to_frame()

#plots pivot table into bar graph
race_gender.plot.bar()
plt.title('Shooting Counts by Gender and Race', fontsize=25)
Out[20]:
Text(0.5, 1.0, 'Shooting Counts by Gender and Race')

The graph above shows the gender/race combo of the shootings. We created this graph so you could visualize the comparison of shootings depending on those two factors. If you notice, across the board, white males are the most likely victim of police brutality cases, second being black males and third being hispanic males.

ANALYSIS & HYPOTHESIS TESTING

In [21]:
#standardized mean and standard deviation to make graph more readable 
income_shooting_df_all['Mean_std'] = (
    (income_shooting_df_all['Mean'] - income_shooting_df_all['Mean'].mean()) /
    income_shooting_df_all['Mean'].std())

income_shooting_df_all['Stdev_std'] = (
    (income_shooting_df_all['Stdev'] - income_shooting_df_all['Stdev'].mean()) /
    income_shooting_df_all['Stdev'].std())

#removes all cities without a population or rate to get more accurate results 
income_shooting_df_all.dropna(axis=0, inplace=True)

income_shooting_df_all
Out[21]:
City State Mean Median Stdev counts population rate Mean_std Stdev_std
0 Abbeville Alabama 40518.000000 25216.0 39126.000000 1.0 2560.0 3.906250 -0.999387 -0.583758
1 Abbeville Louisiana 45970.500000 38442.5 31766.500000 0.0 19470.0 0.000000 -0.787691 -1.121972
2 Abbeville South Carolina 45585.666667 32245.0 41647.333333 0.0 5019.0 0.000000 -0.802632 -0.399368
3 Abbotsford Wisconsin 58254.000000 44919.0 48933.000000 0.0 3833.0 0.000000 -0.310779 0.133446
4 Aberdeen Maryland 96480.000000 80370.0 64102.000000 0.0 16019.0 0.000000 1.173361 1.242784
... ... ... ... ... ... ... ... ... ... ...
11226 Zion Illinois 45042.000000 40427.0 32533.000000 2.0 23487.0 0.851535 -0.823740 -1.065916
11227 Zionsville Indiana 83345.000000 73710.0 51584.000000 0.0 28357.0 0.000000 0.663389 0.327319
11228 Zolfo Springs Florida 38331.000000 32041.0 29600.000000 0.0 1773.0 0.000000 -1.084298 -1.280412
11229 Zumbrota Minnesota 56065.000000 45387.0 44259.500000 0.0 3403.0 0.000000 -0.395768 -0.208335
11231 Zwolle Louisiana 34639.000000 26065.0 31901.000000 0.0 1937.0 0.000000 -1.227641 -1.112136

9174 rows × 10 columns

Since the mean and standard deviation numbers had a larger range, we thought it would be easier and more understandable to standardize the mean and standard deviation. We did this by normalizing the numbers, so if you look at the graphs below, it makes more sense how the rates correlate.

In [22]:
#fills NaN with 0 so correlation purposes
income_shooting_df_all.fillna(0, inplace=True)

income_shooting_df_all.corr()
Out[22]:
Mean Median Stdev counts population rate Mean_std Stdev_std
Mean 1.000000 0.531137 0.843848 -0.011974 0.022935 -0.038483 1.000000 0.843848
Median 0.531137 1.000000 0.315678 -0.008648 0.005874 -0.030387 0.531137 0.315678
Stdev 0.843848 0.315678 1.000000 -0.007687 0.028223 -0.015460 0.843848 1.000000
counts -0.011974 -0.008648 -0.007687 1.000000 0.569619 0.162311 -0.011974 -0.007687
population 0.022935 0.005874 0.028223 0.569619 1.000000 -0.008561 0.022935 0.028223
rate -0.038483 -0.030387 -0.015460 0.162311 -0.008561 1.000000 -0.038483 -0.015460
Mean_std 1.000000 0.531137 0.843848 -0.011974 0.022935 -0.038483 1.000000 0.843848
Stdev_std 0.843848 0.315678 1.000000 -0.007687 0.028223 -0.015460 0.843848 1.000000
In [23]:
#creates cooraltion matrix between desired variables
variables = ['Mean', 'Stdev', 'rate']

correlation = income_shooting_df_all[variables].corr()

correlation['rate']
Out[23]:
Mean    -0.038483
Stdev   -0.015460
rate     1.000000
Name: rate, dtype: float64

Here we are showing the correlation between the amount of shootings with mean, median, and standard deviation. The first thing to note is the correlation between Mean and Counts with a correlation of -0.025369. Although the correlation is small, a negative correlation suggests that as mean income rises, the amount of police shootings decrease. This makes sence as I add assume places with more money have less police brutlaity cases. Another interesting point is that police brutality and standard deviation have a negative correlation, which suggests that cities with lower income inequality tend to see more police brulaity cases.

In [24]:
#graphing relationship between mean and counts
plt.figure(0)
income_shooting_df_all.plot.scatter(x='Mean_std', y='rate', figsize=(10,5))
plt.title('Means vs. Rates', fontsize=25)

#graphing relationship between standard deviation and counts
plt.figure(1)
income_shooting_df_all.plot.scatter(x='Stdev_std', y='rate', figsize=(10,5))
plt.title('Standard Deviation vs. Rates', fontsize=25)

plt.show()
<Figure size 432x288 with 0 Axes>

The graphs above are plots of the counts of the shootings vs. the mean income of cities and states. If you notice the counts vs. mean graph is skewed more to the left, which makes sense assuming that the there is a negative correlation in shootings. If you notice, the cities with the highest mean income tend to have 0 police brutality cases. On the other hand, the standard deviation vs. counts graphs shows something a bit different. With a slight left skew, but more centered data, we can see that there is a range of income inequality where we see the most police brutalities.

Overall, our project does support our hypothesis that low income areas tend to have more police shootings. This would make sense, and lower income areas tend to have more crime, and therefore stricter policing. Similarly, places with large income dispareties also have a stricter police force, which results in more cases of police brutality.

In [26]:
# refine a data frame to cities with populations greater than 100,000
mask = ((income_shooting_df_all['population'] > 100000))

large_cities_df = income_shooting_df_all.loc[mask]
In [27]:
#graphing relationship between mean and counts for cities with populations greater than 100,000
plt.figure(0)
large_cities_df.plot.scatter(x='Mean_std', y='rate', figsize=(10,5))
plt.title('Means vs. Rates (City Populations: >100,000)', fontsize=25)

#graphing relationship between standard deviation and counts for cities with populations greater than 100,000
plt.figure(1)
large_cities_df.plot.scatter(x='Stdev_std', y='rate', figsize=(10,5))
plt.title('Standard Deviation vs. Rates (City Populations: >100,000)', fontsize=25)

plt.show()
<Figure size 432x288 with 0 Axes>
In [28]:
#fills NaN with 0 so correlation purposes
large_cities_df.fillna(0, inplace=True)

large_cities_df.corr()
/opt/conda/lib/python3.8/site-packages/pandas/core/frame.py:4311: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(
Out[28]:
Mean Median Stdev counts population rate Mean_std Stdev_std
Mean 1.000000 0.783759 0.909140 -0.101282 0.038191 -0.106275 1.000000 0.909140
Median 0.783759 1.000000 0.597668 -0.104958 -0.037624 -0.041939 0.783759 0.597668
Stdev 0.909140 0.597668 1.000000 -0.084034 0.107669 -0.126123 0.909140 1.000000
counts -0.101282 -0.104958 -0.084034 1.000000 0.568763 0.344576 -0.101282 -0.084034
population 0.038191 -0.037624 0.107669 0.568763 1.000000 -0.138984 0.038191 0.107669
rate -0.106275 -0.041939 -0.126123 0.344576 -0.138984 1.000000 -0.106275 -0.126123
Mean_std 1.000000 0.783759 0.909140 -0.101282 0.038191 -0.106275 1.000000 0.909140
Stdev_std 0.909140 0.597668 1.000000 -0.084034 0.107669 -0.126123 0.909140 1.000000
In [29]:
#creates cooraltion matrix between desired variables
variables = ['Mean', 'Stdev', 'rate']

correlation = large_cities_df[variables].corr()

correlation['rate']
Out[29]:
Mean    -0.106275
Stdev   -0.126123
rate     1.000000
Name: rate, dtype: float64

In the above cell you can find the correlations for the 100,000 city population data. We wanted to refine our data to large cities in order to analyze the shooting income correlations for high population areas. As compared to the total country data this more refined data provides a stronger correlation. A negative correlation here suggests that as mean income rises, the amount of police shootings decrease. Again, this makes sence as I add assume places with more money have less police brutlaity cases. Furthermore, cities with lower income inequality tend to see more police brulaity cases.

You can observe similar trends seen in the origial two graphs, however within a more refined an accurate range. The mean graph is mostly skewed to the left. This supports our original analysis that for areas with lower income, they experience higher levels of police brutality. Furthermore, for the standard deviation graph, again it is mostly centered. However, for this more refined scale we can observe a stronger skew to the left. These two observations support our original hypothesis that low income areas are subjected to high levels of police brutality.

INSIGHT & POLICY DECISION

Utilizing our analysis, we can effectively locate areas subjected to vulnerability. We can share our findings with the police force in order to more educate, prepare, and avoid for high volumes of shootings for a given city. From this project we have learned just how powerful the intersection and collaboration of a variety of python data tools can be in order to perform an in depth analytical study. We also learned that raw rates are not always effective in comparing certain statistics. Overall, people can use Data analysis to solve problems that face our everyday lives.

Further Resources

An article about the death of george floyd is attached here: https://www.nytimes.com/2020/05/31/us/george-floyd-investigation.html

A link to a non-profit group called the marshall project, which is speer heading the police reform movement is attached here: https://www.themarshallproject.org/records/110-police-reform

In [ ]: