Marc Ojalvo & Kyle Strougo
Website: https://mojalvo.github.io/
Since 2015, there has been almost 5000 cases of police brutality in the United States. With the death of George Floyd this past summer, there has been a call to action to refrom the policing system to prevent unecessary deaths from occuring. (Futher reading about these issues attached at the end of notebook)
Our project hopes to find patters in police brulaity to infrom and educate others on the most suspectible groups of potential victims. We hope to use data to find patters and valuable statitics.
Our project will be looking at the correlation between cities' income and where police shootings occur. We are attempting to answer the following questions:
We hypothesize that there are more police shootings in low-income areas than in higher-income areas and that there is a correlation between income inequality and police shootings.
The primary data set we will be looking at is the mean income of all U.S cities & towns as of 2019. The data sets include more information than just the mean income, such as the county, type, longitude, latitude, median, and standard deviation. After cleaning up the data, we decided to keep the state, city, mean, median, and standard deviation. We decided to keep both the median and mean because a difference between the two could entail skewed data. The standard deviation is a good indicator of income inequality within a city - another data point that we plan to investigate. We found this data set from https://www.kaggle.com/goldenoakresearch/us-household-income-stats-geo-locations/notebooks
Another dataset we will be analyzing is police shootings that occurred between 2015 and 2020. The data set contains the victim's name, the data, race, gender, city, and other information about the shooting. Since we are just looking at the correlation between a city's income and the number of police shootings, we decided to rid most of the data besides where the shooting occurred and the victims' names. We found this data set from https://www.kaggle.com/ahsen1330/us-police-shootings
Since the raw counts of police brutality isn't comparable, since some cities have larger population than others. To ensure our data is comparable, we imported the population of every US city and town and divided turned police brutality cases into per 10,000 people. We recived the data from https://simplemaps.com/data/us-cities.
The mean, median, and standard deviation of the city income data set will be the basis for our project. We plan to compare the number of police shootings in a city with the city's income statistics to test our hypotheses. After mapping each shooting to a city, we will determine if there is a correlation between the data points. Modifying the data to our liking will require utilizing SQL like commands made available in Pandas. We will also use visualization techniques to graph and visualize any trends.
In terms of collaboration, we have been in contact for a few weeks now. After researching what information was available online, we decided that these two datasets could work great together. The two overlap with the necessary attributes to make an analysis and correlation. We expect to find a general trend between the two and are both interested in the possible results. We will not need any more data sets to test our hypotheses.
We have been collaborating through Zoom, sharing our screens, and switching off coding while the other assists. Jupyter Notebook has been an excellent tool for debugging and working with the dataset. Furthermore, GitHub has served as an environment to hold and commit new changes to our files. We have also utilized our local editors and used excel to preview and make changes to data organization.
#installing packages for graphs
!pip3 install geopandas
!pip install descartes
#centers all graphs for orginizational purposes
from IPython.core.display import HTML as Center
Center(""" <style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style> """)
#impoting pandas libarary
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from geopandas import GeoDataFrame
from scipy import ndimage
from scipy import stats
#imports used for heat map
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
#importing income data and turning it into data frame
income_df = pd.read_csv("./data/kaggle_income_edited.csv")
income_df.head()
#removed columns from income data unnessary for our analysis
income_df = income_df[['State_Name','City','Mean','Median','Stdev','Lat','Lon']]
#rename state column to match income name
income_df.rename(columns = {'State_Name':'State'}, inplace=True)
income_df
For our income data, we kepts all information we though would be usefull for the analyis. We thought mean, median and standard deviation would be the most important statistics, and lat and lon for mapping purposes.
#importing shootings data and turning it into data frame
shootings_df = pd.read_csv("./data/shootings.csv")
shootings_df
#crating specific data frames for future graph use
shootings_demographics = shootings_df[['race','gender','age']]
shootings_demographics
#removed columns from shooting data unnessary for our analysis
shootings_df = shootings_df[['city','state']]
#rename state column to match income name
shootings_df.rename(columns = {'state':'State', 'city':'City'}, inplace=True)
shootings_df
For our police brutality data frame, we decided to seperate it into two different frames. One contianing the city and state of the case, and the other containing the demographics of the victims.
#importing population per city data frame and turning it into data frame
population_df = pd.read_csv('./data/uscities.csv', skipinitialspace=True)
population_df.head()
#only keep columns that were needed
population_df = population_df[['state_name','city','population']]
#rename columns for merging purposes
population_df.rename(columns = {'state_name':'State', 'city':'City'}, inplace=True)
population_df
For our polulation data, we only kept the 2019 population since that is all we needed to standardize our police brutality cases
#converts all State names to State codes
shootings_df['State'] = shootings_df['State'].map({
'AL':'Alabama',
'AK':'Alaska',
'AZ':'Arizona',
'AR':'Arkansas',
'CA':'California',
'CO':'Colorado',
'CT':'Connecticut',
'DE':'Delaware',
'FL':'Florida',
'GA':'Georgia',
'HI':'Hawaii',
'ID':'Idaho',
'IL':'Illinois',
'IN':'Indiana',
'IA':'Iowa',
'KS':'Kansas',
'KY':'Kentucky',
'LA':'Louisiana',
'MA':'Maine',
'MD':'Maryland',
'MA':'Massachusetts',
'MI':'Michigan',
"MN":'Minnesota',
"MS":'Mississippi',
'MO': 'Missouri',
'MT':'Montana',
'NE':'Nebraska',
'NV':'Nevada',
'NH':'New Hampshire',
'NJ':'New Jersey',
'NM':'New Mexico',
'NY':'New York',
'NC':'North Carolina',
'ND':'North Dakota',
'OH':'Ohio',
'OK':'Oklahoma',
'OR':'Oregon',
'PA':'Pennsylvania',
'RI':'Rhode Island',
'SC':'South Carolina',
'SD':'South Dakota',
'TN':'Tennessee',
'TX':'Texas',
'UT':'Utah',
'VT':'Vermont',
'VA':'Virginia',
'WA':'Washington',
'WV':'West Virginia',
'WI':'Wisconsin',
'WY':'Whyoming'
})
shootings_df
We had to change the state names to their associated state code so that we could properly merge the two data sets. We want to merge both data sets on city and state since there are some city names that are used multiple times. To ensure that these duplicate city names were not confused, we had to merge on state as well. The income data set had each city using a state name, but the shootings data set had each city using the state code.
#count amount of shootings in each city and create data frame
shooting_counts = shootings_df['City'].value_counts().rename_axis('City').reset_index(name='counts')
#combines data frames to include shooting count
shootings_df = shootings_df.merge(shooting_counts, on=['City'], how='inner')
#creates data frame that includes shooting count per city (only counting cities with shootings)
income_shooting_df_inner = income_df.merge(shootings_df, on=['City', 'State'], how='inner')
#drop dulicate cities that came from income dataset
income_shooting_df_inner.drop_duplicates(subset=['City'], inplace=True)
#sorts by city alphabetically for orginazation purposes
income_shooting_df_inner.sort_values('City', inplace=True)
income_shooting_df_inner
#creates data frame that includes shooting count per city (counting all cities shootings)
income_shooting_df = income_df.merge(shootings_df, on=['City', 'State'], how='left')
#replaces 0 with nans so mean is not skewed
income_shooting_df.replace(0, np.nan, inplace=True)
#groups by city/states because there are multiple city/state combos
income_shooting_df_all = income_shooting_df.groupby(['City','State']).agg({'Mean': 'mean', 'Median': 'mean', 'Stdev': 'mean', 'counts':'mean'})
income_shooting_df_all = income_shooting_df_all.reset_index()
#sorts by city alphabetically for orginazation purposes
income_shooting_df_all.sort_values('City', inplace=True)
#replaces all Nans with 0
income_shooting_df_all.replace(np.nan, 0, inplace=True)
income_shooting_df_all
Another issue we had with our data was that the income data was separated by area code and not just cities. We originally just dropped duplicate names, and used the data for the first instance of each name. However, after doing some deep analysis, we realized that we should not be doing so because then our data wouldn’t be accurate. Instead, we grouped all instances of a city/state pair, and found the mean of the Mean, Median, and Standard Deviation.
#adds population to each city/state combo
income_shooting_df_all = income_shooting_df_all.merge(population_df, on=['City', 'State'], how='left')
#creates a rate column to make sure data is comparable
income_shooting_df_all['rate'] = (income_shooting_df_all['counts']/ income_shooting_df_all['population']) * 10000
income_shooting_df_all
After merging our income, police brutality cases, and population data frames, we had to get the rate of police brutality cases per 10,000 people. To do that, we divided the count of police brutality cases by the population of the city, and then multiplied it by 10,000. By having the rates, the data is more comparable, since some cities have such a small population but high police brutality numbers.
#plots the longitutate and lagititude of each shooting
geometry = [Point(xy) for xy in zip(income_shooting_df_inner['Lon'], income_shooting_df_inner['Lat'])]
gdf = GeoDataFrame(income_shooting_df_inner, geometry=geometry)
#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
ax = gdf.plot(ax=world.plot(figsize=(50, 10)), marker='o', color='red', markersize=15);
#zooms into the united states
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
#sets map title
ax.set_title("Police shootings in the United States", fontsize=25)
In the map above we plotted the shootings in the US. Each red dot represents the location of a shooting. Clusters represent multiple shootings. If you notice the more populated cities, tend to have more shootings which makes sence. Futermore, we do not see as many police shootings in the midwest, or southwestern part of the country. Most shootings occur in the east, with the excpetion of California
from scipy import ndimage
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
#code come from: https://nbviewer.jupyter.org/gist/perrygeo/c426355e40037c452434#function to create heatmap comes from
def heatmap(d, bins=(100,100), smoothing=1.3, cmap='jet'):
def getx(pt):
return pt.coords[0][0]
def gety(pt):
return pt.coords[0][1]
x = list(d.geometry.apply(getx))
y = list(d.geometry.apply(gety))
heatmap, xedges, yedges = np.histogram2d(y, x, bins=bins)
extent = [yedges[0], yedges[-1], xedges[-1], xedges[0]]
logheatmap = np.log(heatmap)
logheatmap[np.isneginf(logheatmap)] = 0
logheatmap = ndimage.filters.gaussian_filter(logheatmap, smoothing, mode='nearest')
plt.imshow(logheatmap, cmap=cmap, extent=extent)
plt.colorbar()
plt.gca().invert_yaxis()
plt.show()
heatmap(income_shooting_df_inner, bins=70, smoothing=1)
Since the map before did not properly incdicate the amount of shootings per area, we decided to create a heat map to show the amount of shooting given in an rea. The hotter the color ( so the closer to red) the most shootings there are in a location. If you notice as before, most shootings occur in the east, with the exception of California.
#to display charts on top of eachother
plt.figure(0)
#counts race of each shooting victim
race_counts = shootings_demographics['race'].value_counts()
#plots counts into pie graph
race_counts.plot.pie(figsize=(7,7), colors=['darkgreen', 'crimson', 'pink', 'yellow', 'orange', 'brown'])
plt.title('Race of Victims ', fontsize=25)
#to display charts on top of eachother
plt.figure(1)
#counts gender of each shooting victim
gender_counts = shootings_demographics['gender'].value_counts()
#plots counts into pie graph
gender_counts.plot.pie(figsize=(7,7), colors=['blue', 'fuchsia'])
plt.title('Gender of Victims', fontsize=25)
#to display charts on top of eachother
plt.figure(2)
#plots age distribution of victimes
shootings_demographics['age'].plot.hist(bins=25, figsize=(10,5), color=['black'])
plt.title('Age distribution of Victims', fontsize=25)
#shows all the plots
plt.show()
The first pie chart above shows the proportion of shootings by race. The major take away from this graph is that proportionately speaking white people are more likely to get shot by a police than black people. However, the argument about black people experiencing worse treatment by cops than there counterparts derives from the percentage of shooting compared to there population. According to governing.com, African Americas comprise of 12.5% of the american population, but according to our stats make up approximately 25% of police brutality cases. Where as white people make up about 60% of the US polulation but only account for half of police brutality case.
The second pie chart shows the proportion of victims by race. Clearly there the majority of police brutality victims are males.
The last chart is a histogram of police brutality victims by age. The marjoity of victimes fall in the late 20s early 30 age group, with the least falling past 65.
#creates pivot table
race_gender = (shootings_demographics.
groupby(["gender", "race"])['gender'].
count())
#turns table in Dateframe
race_gender.to_frame()
#plots pivot table into bar graph
race_gender.plot.bar()
plt.title('Shooting Counts by Gender and Race', fontsize=25)
The graph above shows the gender/race combo of the shootings. We created this graph so you could visualize the comparison of shootings depending on those two factors. If you notice, across the board, white males are the most likely victim of police brutality cases, second being black males and third being hispanic males.
#standardized mean and standard deviation to make graph more readable
income_shooting_df_all['Mean_std'] = (
(income_shooting_df_all['Mean'] - income_shooting_df_all['Mean'].mean()) /
income_shooting_df_all['Mean'].std())
income_shooting_df_all['Stdev_std'] = (
(income_shooting_df_all['Stdev'] - income_shooting_df_all['Stdev'].mean()) /
income_shooting_df_all['Stdev'].std())
#removes all cities without a population or rate to get more accurate results
income_shooting_df_all.dropna(axis=0, inplace=True)
income_shooting_df_all
Since the mean and standard deviation numbers had a larger range, we thought it would be easier and more understandable to standardize the mean and standard deviation. We did this by normalizing the numbers, so if you look at the graphs below, it makes more sense how the rates correlate.
#fills NaN with 0 so correlation purposes
income_shooting_df_all.fillna(0, inplace=True)
income_shooting_df_all.corr()
#creates cooraltion matrix between desired variables
variables = ['Mean', 'Stdev', 'rate']
correlation = income_shooting_df_all[variables].corr()
correlation['rate']
Here we are showing the correlation between the amount of shootings with mean, median, and standard deviation. The first thing to note is the correlation between Mean and Counts with a correlation of -0.025369. Although the correlation is small, a negative correlation suggests that as mean income rises, the amount of police shootings decrease. This makes sence as I add assume places with more money have less police brutlaity cases. Another interesting point is that police brutality and standard deviation have a negative correlation, which suggests that cities with lower income inequality tend to see more police brulaity cases.
#graphing relationship between mean and counts
plt.figure(0)
income_shooting_df_all.plot.scatter(x='Mean_std', y='rate', figsize=(10,5))
plt.title('Means vs. Rates', fontsize=25)
#graphing relationship between standard deviation and counts
plt.figure(1)
income_shooting_df_all.plot.scatter(x='Stdev_std', y='rate', figsize=(10,5))
plt.title('Standard Deviation vs. Rates', fontsize=25)
plt.show()
The graphs above are plots of the counts of the shootings vs. the mean income of cities and states. If you notice the counts vs. mean graph is skewed more to the left, which makes sense assuming that the there is a negative correlation in shootings. If you notice, the cities with the highest mean income tend to have 0 police brutality cases. On the other hand, the standard deviation vs. counts graphs shows something a bit different. With a slight left skew, but more centered data, we can see that there is a range of income inequality where we see the most police brutalities.
Overall, our project does support our hypothesis that low income areas tend to have more police shootings. This would make sense, and lower income areas tend to have more crime, and therefore stricter policing. Similarly, places with large income dispareties also have a stricter police force, which results in more cases of police brutality.
# refine a data frame to cities with populations greater than 100,000
mask = ((income_shooting_df_all['population'] > 100000))
large_cities_df = income_shooting_df_all.loc[mask]
#graphing relationship between mean and counts for cities with populations greater than 100,000
plt.figure(0)
large_cities_df.plot.scatter(x='Mean_std', y='rate', figsize=(10,5))
plt.title('Means vs. Rates (City Populations: >100,000)', fontsize=25)
#graphing relationship between standard deviation and counts for cities with populations greater than 100,000
plt.figure(1)
large_cities_df.plot.scatter(x='Stdev_std', y='rate', figsize=(10,5))
plt.title('Standard Deviation vs. Rates (City Populations: >100,000)', fontsize=25)
plt.show()
#fills NaN with 0 so correlation purposes
large_cities_df.fillna(0, inplace=True)
large_cities_df.corr()
#creates cooraltion matrix between desired variables
variables = ['Mean', 'Stdev', 'rate']
correlation = large_cities_df[variables].corr()
correlation['rate']
In the above cell you can find the correlations for the 100,000 city population data. We wanted to refine our data to large cities in order to analyze the shooting income correlations for high population areas. As compared to the total country data this more refined data provides a stronger correlation. A negative correlation here suggests that as mean income rises, the amount of police shootings decrease. Again, this makes sence as I add assume places with more money have less police brutlaity cases. Furthermore, cities with lower income inequality tend to see more police brulaity cases.
You can observe similar trends seen in the origial two graphs, however within a more refined an accurate range. The mean graph is mostly skewed to the left. This supports our original analysis that for areas with lower income, they experience higher levels of police brutality. Furthermore, for the standard deviation graph, again it is mostly centered. However, for this more refined scale we can observe a stronger skew to the left. These two observations support our original hypothesis that low income areas are subjected to high levels of police brutality.
Utilizing our analysis, we can effectively locate areas subjected to vulnerability. We can share our findings with the police force in order to more educate, prepare, and avoid for high volumes of shootings for a given city. From this project we have learned just how powerful the intersection and collaboration of a variety of python data tools can be in order to perform an in depth analytical study. We also learned that raw rates are not always effective in comparing certain statistics. Overall, people can use Data analysis to solve problems that face our everyday lives.
An article about the death of george floyd is attached here: https://www.nytimes.com/2020/05/31/us/george-floyd-investigation.html
A link to a non-profit group called the marshall project, which is speer heading the police reform movement is attached here: https://www.themarshallproject.org/records/110-police-reform