- This document was rendered last on 2017-11-30
Authors
- To shower praise for ingenuity on the project, contact Melody Liu
- For criticism of avenues we couldn’t investigate in 4 weeks contact Gage Sonntag
Executive Summary
- This project was produced for the Text Analytics Workshop for the Winter 2018 Masters of Management Analytics Cohort at Queen’s University
- The goal from the outset was to use text analytics techniques developed in class to examine jobs companies have posted on Indeed in Toronto and employ techniques discussed in class including some of: tokenization, data cleaning, document clustering, topic modelling, network analysis and visualization.
Project Rationale
- A open sourced project working with real world data was desired
- Other projects can be found scraping DS/Analytics jobs from Indeed. Typically word frequencies for keywords like Python or Hadoop are calculated
- Moving beyond that, we were interested in clustering and how the choice of words signals relationships between roles, as well as how skills relate, not just their frequency
- Job postings fit the ‘bag of words’ or ngram approach taught in class. Not many employers say “We don’t want someone who knows Python”
Gathering Data
- Beautiful Soup & Selenium were used in Python to access Indeed and scrape unsponsored job titles, companies, and postings
- 1800 jobs were scraped from 9 search terms we believed captured the jobs most MMA students are pursuing.
-
Jobs were passed from Python to R using Feather
- Our data returned 636 unique jobs within our search.
- Considerable data cleaning is required to get to something easy to analyze. This includes stripping remaining HTML from our text, removing custom low value words, and words too common in job postings.
Exploratory Data Analysis
- We expect 200 jobs for each result, and removing the duplicate jobs in the order they were searched.
- Interestingly, searching 200 jobs in analytics returns only about half unique jobs, so by the time you reach page 10, you are seeing very little new things.
- As we search overlapping terms, data sciencist, data insights, fewer and fewer unique jobs are returned
- Interestingly, each additional search term returns a surprising amount of new jobs. A reasonable amount are shown for machine learning that were not found for data scientist or analytics, an overlapping field.
- Business Intelligence and marketing analytics seems to be orthogonal to other search terms, returning relatively more unique jobs
- The job landscape is currently dominated by data scientists, which have become a catch all word. But it’s encouraging to see machine learning engineers and developer roles begin to be fleshed out.
- Analytics is surprisingly absent, but is likely wrapped into titles like “Manager, Analytics” which is more inconsistently titled. Let’s take a closer look at where our Analytics jobs are.
- These searches appear less consistent than job titles like Data Scientist.
- This seems to resonate with what the Toronto Job environment is as a whole: Consulting, Banking, Telecom and a splattering of retail.
A Word Frequency Approach
- The boiler plate at the end of each job posting, encouraging people to apply, discussing company acolades and culture distort our analysis. Let’s spend some time cleaning up job specific words and html related language
- We’ve removed most of the job specific language, apply, description and words that don’t signal much about what the job is. We see from a frequency approach, there isn’t alot to be gleaned.
- Some words are mentioned in every posting. Analytics as a search term appeared to have proportionally more management oriented positions.
-
Let’s see if our bi-grams have more signal.
- This is more encouraging than our Unigrams. We have some domain specific phrases, like mental health and real estate. But also communication skills and problem solving which straddle the hard and soft skills often critical to success in analytics and data science.
- Some of these phrases may be loaded in a small number of job postings. For example, digital marketing being mentioned many times in 1 posting referring to the job title, department, and responsibilities. Let’s remove phrases mentioned more than once and see more of the breadth of mentions.
- This begins to get a bit more accurate of a assessment of what employers mention. Some of these highlight more useful skills that were drowned out by more freqent mentions. These are things like project management or software engineering, useful skills for data scientists and analysts.
A Skills Based Approach
-
Typically when you see projects like this done, people look for some Analytics or Data Science skills, and count the occurences. We want to go beyond that, but lets examine the landscape for analytical skills in Toronto.
- Our list is a few dozen unigram skills that we believe capture the technologies worked in across analytics and data science. Broadly they’ll get classified as Big Data, Data Analysis and Visualization to capture the analysis and communication of results, as well as the unique tools for cloud & distributed computing.
- This seems to suggest excel, R and SQL are in high demand. Let’s examine how inter related these concepts are.
- Are the same jobs looking for R excel and SQL?
- How many of these skills are required for different jobs?
- For the skills we have selected, analytics and data scientists have long tails. These are likely associated with the similarity between the big data tools we selected: hive, scala, spark etc, but also suggest companies are casting a wide net in terms of people’s experience.
- For the words we selected, many jobs in marketing analysis and business intelligence don’t seem to leverage them as much as other positions.
- Let’s see how theses skills get mentioned together.
A Network Diagram of Skills
- The network analysis shown shows a few interesting groupings with darker lines representing more frequently correlated words. A line between two words representing a likelihood to be mentioned together in the same job.
- Excel and powerpoint don’t seem correlated with the rest of our tech stack, despite the frequent mentions of excel (which presumably are the noun and not the verb)
- Traditional Analytics - R, SAS, and SPSS seem inter-related.
- Big Data - Python leveraging Hadoop, AWS, Scala and spark. Interestingly R is not the language of big data despite some support from spark.
- BI/Data Viz - Tableau, microstrategy and qlik supported by SQL.
- The most freqent words, R, SQL, and excel no longer seem as inter-related.
- Let’s look at clustering our data set, to see if these groups are also represented when we cluster on all the words in the posting.
Conclusion
- While employers demand a variety of technical skills, it’s notable that softer skills are also important. The role of analytics in an organization is not only to generate insight but also to communicate it.
- R, SQL and Excel are most demanded tools in Toronto, but not in the same roles.
- Distinct groupings could be seen for skillsets in conventional analytics tools, data visualization & dashboarding, and the big data tech stack.