Problem Statement
You’re working as a data scientist with a research firm that specializes in emergency management. In advance of client work, you’ve been asked to create and train a logistic regression model that can show off the firm’s capabilities in disaster analysis.
Frequently after a disaster, researchers and firms will come in to give an independent review of an incident. While your firm doesn’t have any current client data that it can share with you so that you may test and deploy your model, it does have data from the 1912 titanic disaster that it has stored in a remote database.
In this project, we’ll be using data on passengers from the Titanic disaster to show off your analytical capabilities. The data is stored in a remote database, so you’ll need to set up a connection and query the database (using Python!). After, you’ll construct a logistic regression model and test/validate its results so that it will be ready to deploy with a client.
Acquiring Data and Exploring its Characteristics
First, I connected to the remote postgres database through terminal, then I utilized sqlalchemy python plug-in to import the data to the notebook. Exploring the data through a variety of methods including seaborn heat maps, pair plots, .info, and .describe revealed high level view about the data. Also, I created a bar chart of ages and the mean who survived. This is on a level of 1 as survived and 0 as dead. This revealed which ages had a better probability of surviving, but steps later in the process will prove this hypothesis.
The data cleaning began with creating dummy variables for gender as to make the model easier to process this variable. According to National Statistical Standards the age distribution of the population for demographic purposes should be given in five-year age groups extending to 85 years and over. Since this is the case, I decided to bin the ages of the passengers. Once again I dummied the variables along with the rest of the information that would go into the model. This brought my inputs to age, gender, class and # of parents / children aboard the Titanic.
Assumptions and Risks
Before going into the model creation, I must admit I have assumed that the data without age can be dropped and that this is representative of all data. Plus, this data was recorded a long time ago and I must assume that survived is accurate.
Model Overview
As always, once the data is cleaned, a train and test split must occur. Once four different pieces of data (X_train, X_test, Y_train, Y_test) are created I ran a standard logistic regression.
Model Performance
The strongest, most powerful variable is for gender with a coefficient of -2.09. There are so many ways to test the performance of this model. I used a cross validation score, a classification report (including an f1 score), confusion matrix and area under ROC Curve.
Let’s take these one at a time. The cross validation score with 3 k-folds was 0.84. I used this on the basis of “f1 weighted”, so as you may guess it is on a scale of 0 to 1, meaning the model performed pretty well. The classification report returns an f1 score- this time does not use folds to evaluate and instead compares the y test and predicted values. This f1 score was .82, also good but not great. Both of these numbers show a balance between precision and recall as many other types of evaluations attempt to display. Another way of looking at this is balance between overfitting and underfitting.
The confusion matrix showed similar findings. Errors of predicted survival tend to occur 18% of the time. Specifically, there were 15 occurrences of a false positive and 28 occurrences of false negatives.
The last way I evaluated this model is with the area under the ROC curve. Once again this evaluated from 0 to 1 and displays accuracy.
Model Improvements
As with any data science problem, models and data can always be tweaked to improve. For example. grid search or using other types of models. I utilized KNN and grid search in my Jupyter Notebook, which you can see in the link below.
Find Jupyter Notebook for this project here