Problem Statement
“You’re working as a data scientist for a contracting firm that’s rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal thinks the best way to gage salary amounts is to take a look at what industry factors influence the pay scale for these professionals.”
Data Scraping and Cleaning
To explore data science salaries, the team began with scraping Indeed.com job postings to gather information about salaries, titles and companies. This scraper was a modified version of another scraper provided by a teammate. We scraped over 10,000 job postings from 14 major tech cities across the US. The data was cleaned to make it easier to analyze. Some aspects of the cleaning included removing job postings without a salary listed, simplify job title listings to a set number of categories and converted all pay units to a yearly amount. In the case of the last aspect of cleaning, we based our conversion to a 50 week year, as that tends to be the average.
Assumptions and Risks
As with any project, our team came to a conclusion about assumptions and risks of our data. This included:
–Data scrapped is a representative sample
–The salaries posted online reflect actual salaries of those in the industry.
–The direction of causality is not able to be determined.
–Not many job postings include salary
Model Overview
Based upon the data we built a model that could predict whether a salary would be above or below the mean. Building a model that predicted exact amounts would not be very useful as jobs vary so much depending upon the person who is hired. This was the target variable for the model. The features used in the model dummy variables of the title, state, city and company.
Model Performance
To ensure the model performed well, we evaluated the ROC curve and its area. The area of the ROC curve proved to be .88. This is good since a score of 1 would be perfect accuracy.
Also, a confusion matrix was constructed to do a similar evaluation- to visualize the false positives, true positives, false negatives and true negatives.
Find Jupyter Notebook for this project here