Understanding the Impact of a PhD on Data Scientist Salaries
Written on
Chapter 1: Overview of Data Scientist Salaries
In analyzing the factors affecting a Data Scientist's annual pay, we seek to understand the significance of various attributes such as educational background, organization size, and coding expertise. This exploration builds upon earlier discussions regarding career paths in data professions.
To delve deeper into this topic, we examine the public dataset available on Kaggle, which compiles salary data for Data Scientists, Analysts, and Engineers from 2017 to 2020, sourced from the Stack Overflow Annual Developers Survey. For a comprehensive analysis, please refer to the linked Kaggle notebook.
Step 1: Data Preprocessing
The initial phase of our analysis involves several preprocessing steps:
- Selecting data from a single country (United States).
- Adjusting the compensation figures to thousands of USD per year.
- Excluding the top and bottom 5% of respondents based on compensation.
- Filtering for high cardinality in categorical features.
- Filling in missing values.
Step 2: Developing a Predictive Model
In this step, we split the preprocessed data into training and testing datasets. We utilize the CatBoostRegressor model, which effectively handles categorical data. The resulting model achieves a root mean squared error (RMSE) of approximately $32,000/year, showing improvement over the baseline model's RMSE of about $37,000/year, which assumes a uniform salary of $108,000/year across all respondents.
Step 3: Analyzing the Machine Learning Model
To explain our model's predictions, we apply the SHapley Additive exPlanations (SHAP) method, a widely-used approach for interpreting machine learning outcomes. The SHAP values are reported in thousands of USD per year.
We begin by examining the distribution of SHAP values across various features of interest:
From our findings, the most significant factor influencing salaries is the years of professional coding experience (YearsCodePro variable). It is evident that respondents with extensive coding backgrounds earn considerably more, with a salary gap of around $50,000/year between the most and least experienced individuals.
Interestingly, those who hold multiple roles—such as Data Analyst, Business Analyst, or Database Administrator—tend to see a decrease in their expected yearly compensation, dropping by as much as $10,000/year. This trend can be attributed to the generally lower salaries associated with these positions compared to Data Scientists. However, this downward trend does not apply to Data Engineers, whose salaries are comparable to Data Scientists.
The Influence of Educational Attainment
Unsurprisingly, holding a PhD has the most substantial positive effect on salary. However, when we analyze SHAP values over the years (2017 to 2020), we observe a decline in the value attributed to a PhD. The average SHAP value for a PhD during the 2017–2020 period is $8,100/year, with $10,600/year in 2017 and only $5,300/year in 2020. This indicates a diminishing return on a doctoral degree for Data Scientist roles over time.
The Role of Company Size
The data reveals no significant trend in salary variation concerning the size of the organization. The average salary difference between very large firms (over 5,000 employees) and very small companies (1-20 employees) is minimal, capped at around $1,000/year.
Lastly, we observe a consistent annual increase in predicted salaries, approximately $4,000/year, equating to a growth rate of roughly 4% each year from 2017 to 2020.
I hope this analysis proves beneficial. Should you have any questions or comments, feel free to reach out in the comments section below or connect with me on LinkedIn or Twitter.
Chapter 2: Salary Insights from Data Scientists
In this video, we discuss the realities of Data Scientist roles, including insights on salaries for entry-level positions.
This video compiles various salary data for Data Scientists, providing transparency on compensation across the industry.