There are plenty of Python tutorials that show you how to create a Beautiful Soup script that pulls jobs from Indeed to LinkedIn, but not many for R. So, I decided to showcase that R has the same ability as Python (and I think a little more intuitive). I’ve chosen to scrape a UK job website, https://www.reed.co.uk/ based on this Towards Data Science article that shows how to do it in Python (You can find the script here). Reed.co.uk is a job vacancy website similar to Indeed.
Where to Begin
When I first began to scrape the web, I would blindly flail about pulling in this or that until I got the information that I needed. This was an inefficient way of scraping websites both from a time and a resource perspective. The best way, and I’m sure countless others will agree, is to start by making a plan. I start with minimal exploratory analysis and then, depending on the content of the site, come up with ideas for the data I want to collect.
EDA
Let’s do the most basic thing and look at the website in question. Here’s the link: https://www.reed.co.uk/jobs/. I used the ‘what’ search area to type in ‘data scientist’ and the ‘where’ search area to type in ‘London’. The URL now has two parameters (but the URL will show one) and it looks like ‘https://www.reed.co.uk/jobs/data-scientist-jobs-in-london’.
On the side, I also clicked on ‘full-time positions’ and ‘within 50 miles’, this added more parameters and transformed the URL to ‘https://www.reed.co.uk/jobs/data-scientist-jobs-in-london?fulltime=True&proximity=50'.
Now that we have the URL for which we are going to pull information from, let’s come up with a plan for what we have on the page and what we want to pull.
Planning
I ultimately want to analyze the positions and see which skills are needed for a Data Scientist position in the UK and I want to see which positions pay the most. I came up with a list of what I needed and I added why I wanted those items:
- Description: I want the job description so that I can use some basic NLP to show the skills that are needed for each position.
- Position: Sometimes the job is Data Science or Machine Learning Engineer, etc. I want to be able to see if there is a difference in the job title, skills, and pay.
- Posted: I want to see when the job was created so that I can get a rough idea of the job market, my idea is that a job with a longer posted time has high demand and low supply.
- Salary: I am curious about how this changes with titles and skills.
- Location: I want to see whether the jobs are in London or the suburbs (I’ve set the distance for within 50 miles of London).
- Contract: I want to see if there is a difference in salary between permanent and temporary positions.
- Company: I want to know who is hiring and if that affects salary.
- Company Type: I want to know if the salary differs between a direct hiring company and a recruiter.
- Industry: I want to see if there is a difference in salary among different industries.
- URL: In case any issues arise, I will be able to plug in the URL and see if the page differs from any others.
Testing
I immediately noticed that there was going to be an issue with my dream list; on the main page that lists all the jobs, the descriptions are cut off. The only way to get the full description is to go to each job listing page. But this is quite simple, the first task we will need to do is scrape the URLs for all the individual job pages from the job listings.
Getting the URLs
Luckily, Reed.co.uk uses a URL pattern that can easily be scraped. As you can see in the screenshot above, each job page is identified by a unique number. On the job listing page, there is a link to all these pages, so we can just scrape the URL from each listing.
The first thing to do is open your browser’s Developer Web Tools, it’s different for every browser. Personally, I use Chrome. On Chrome there is an inspect element tool that lets you inspect an element, when I use that and click on the title, it looks like this:
On the left (highlighted in blue), it gives information about what you are seeing, while on the right (highlighted in grey) it gives the code. In the task for getting URLs, we can see that there is an element called ‘data-id’. This is the URL parameter for each job. We simply need to scrape that part of the code. However, how will we know how many jobs/pages to scrape?
After looking around I managed to find two pieces of information in the code that will show us how many jobs there are and how to scrape them. First, the page number is dictated by a URL parameter and otherwise stays the same. Second, the job number is listed on the webpage(See screenshot below).
So what we need to do is pull it in and create a value vector for both the job we are on and the total number of jobs.
First, we need to load our packages, all we will need is the Tidyverse package since rvest is contained in the package.
library("tidyverse")
Next, we want to pull the html webpage into R.
reed_data <- read_html("https://www.reed.co.uk/jobs/data-scientist-jobs-in-london?fulltime=True&proximity=50")
If you want to see what the website looks like in R, simply use the html_text() function.
reed_data %>% html_text()
Now we are going to get the job count and total jobs. Basically, we are going to use the html_nodes() to identify the text that we want, then split it so it’s in a list. Then from that list, we are going to extract the two elements we need. I got these numbers by playing around and printing the list.
job <- reed_data %>%
html_nodes("div") %>%
html_nodes(xpath = '//*[@class="page-counter"]')%>%
html_text() %>%
strsplit(" ")current_job <- as.numeric(job[[1]][27])
total_job <- as.numeric(job[[1]][29])
paste('On this page there are', current_job, 'jobs out of a total of', total_job, "jobs")
After running the job, the code will return “On this page there are 25 jobs out of a total of 581 jobs” (The total number of jobs will differ between how many there are on that day).
The next thing to do is build a loop that will go through every page and extract the ‘data-id’ we identified above. For this, we will run a loop that checks to see if we have all the jobs, and if not, go to the next page.
n_page=1while (current_job < total_job){
# This will concatenate the url depending on the page
p = str_c('https://www.reed.co.uk/jobs/data-scientist-jobs-in-london?pageno=',n_page,'&fulltime=True&proximity=50', sep="")
URL_p = read_html(p)
# This will get the url
url <- URL_p %>%
html_nodes("div") %>%
html_nodes(xpath = 'a')%>%
html_attr('data-id')
url <- url[!is.na(url)]
# This appends the data together
job_url <- append(job_url, url)
# This gets the new job count and changes current job to that number
job <- URL_p %>% html_nodes("div") %>% html_nodes(xpath = '//*[@class="page-counter"]')%>% html_text() %>% strsplit(" ")
current_job <- as.numeric(job[[1]][27])# This tells us to go to the next page
n_page <- n_page + 1
}
paste("There are now", current_job, "jobs out of a total of", total_job, "jobs")
Once run, it will let you know that “There are now 581 jobs out of a total of 581 jobs”. And ‘job_url’ will be a list of all the URL parameters. When this step is complete, the next task is to scrape the information that we need.
Gathering the Information
First off, we have to look at the code for a jobs page to see what we can even scrape. As an example, I’ll show the process I took to find what I needed to pull in the description.
When I went into Developer tools I used the inspect element tool to click on the largest section of the description as I could, if you are following along it should look like this:
But I don’t want all that information, I want all the information below the box where it says ‘Data Scientist’. So in the developer view, I clicked on the ‘div’ classes until I found one in the general area that I wanted. In this case, it was easy, it was ‘ div class=“description” ’.
Then all we need to do is test it out on a URL for one of the job pages (this might not exist by the time you are running it).
URL_test = read_html('https://www.reed.co.uk/jobs/data-scientist/42066794')
# Let's get the description
Desc <- URL_test %>% html_nodes("[itemprop='description']") %>%
html_text()
Desc
The results will look like:
“Data Scientist £45,000 — £70,000 London/ UK based Remote Are you a Data Science consultant with a passion for health care and having a positive impact? Do you want to work for a data-driven company that advises the NHS on strategy on various different aspects of the COVID-19 pandemic? THE COMPANY: As a Data Scientist, you will work in an established management consultancy in the health-care space. They use cutting-edge AI technology to understand client requirements and build models to provide solutions to help sustain change. The team of data scientists and data engineers is undergoing big expansion plans. THE ROLE: The role of Data Scientist will involve developing predictive models and software tools to advise on strategy. In specific, you can expect to be involved in the following: You will be building predictive models to help clients understand and tackle big problems in health careYou will be working with large data sets to train models and develop actionable insightsYou will be interacting with stakeholders and clients, with varying technical skills, to explain insights and analysisYou will be building machine learning and deep learning algorithms YOUR SKILLS AND EXPERIENCE: The successful Data Scientist will have the following skills and experience: Educated to PhD/ MSc level in a relevant disciple (Statistics, Computer Science, Mathematics etc)Commercial experience as a machine learning engineer/ data scientist in a consultancyExperience working in or a genuine interest in health care or bioinformatics is a bonusPython, AWS, AzureTHE BENEFITS: The successful Data Scientist will receive a salary, dependent on experience of up to £70,000 and benefits. HOW TO APPLY: Please register your interest by sending your CV to Francesca Curtis via the Apply link on this page.”
It works! The next step is to identify the other points of data we want to scrape. I am only going to show you one more example because the rest are fairly easy to figure out. I’m going to show you how to get ‘Company Type’ because it is in Javascript and is not easy to get with the rvest tools. You will need to use regex to get it.
In order to find the hidden elements in Javascript, you will need to know a little backend development, all you need to know is that some websites will use Javascript to push information from a database to the browser. It makes it easy to automatically create pages if you have one template for all of the information to be sent to.
In this particular place, you can see that the information lives in Javascript (it’s actually pushed to the webpage but for the sake of this tutorial I wanted to show how to pull in Javascript).
The code will look like this:
URL_test = read_html('https://www.reed.co.uk/jobs/data-scientist/42066794')
# Let's get the company type. Since it is in the Javascript, we need to use regex to extract the value
Compt <- URL_test %>% str_extract("(jobRecruiterType: )'(\\w+\\s\\w+\\s\\w+|\\w+\\s\\w+|\\w+|\\s)") %>%
str_extract("(?<=\\')\\D+")
compt
The results will look like:
“Recruitment consultancy”
Now that you learned how to pull in both Javascript and HTML, I’ll show you what my code looks like to scrape all the pages. Basically, we are creating a loop that will go through the list of URLs for each job site (the data-ids we extracted above), and then, on each of those pages, it will scrape the information we want and add those into a dataframe.
For some computers, this will take upwards of 10 mins. I would make sure you are checking it on one URL before you run the loop on all the pages.
for (i in unique(job_url)) {
p = str_c('https://www.reed.co.uk/jobs/data-scientist/',i, sep="")
URL_p = read_html(p)
# Let's get the description
Desc <- URL_p %>% html_nodes("[itemprop='description']") %>%
html_text()
Desc <- str_trim(Desc, side = "left")# Let's get the position
Pos <- URL_p %>% html_node("title") %>%
html_text()
# Let's get the posted date
Post <- URL_p %>% html_nodes("[itemprop='datePosted']") %>%
html_attr('content')
# Let's get the salary
Sal <- URL_p %>% html_nodes("[data-qa='salaryLbl']") %>%
html_text()
Sal <- str_trim(Sal, side = "left")
# Let's get the location
Loc <- URL_p %>% html_nodes("[data-qa='regionLbl']") %>%
html_text()
# Let's get the contract
Cont <- URL_p %>% html_nodes("[data-qa='jobTypeMobileLbl']") %>%
html_text()
# Let's get the company name
Comp <- URL_p %>% html_nodes(css ="[itemprop='hiringOrganization']") %>%
html_nodes(css ="[itemprop='name']") %>%
html_text()
Comp <- str_trim(Comp, side = "left")
# Let's get the company type. Since it is in the Javascript, we need to use regex to extract the value
Compt <- URL_p %>% str_extract("(jobRecruiterType: )'(\\w+\\s\\w+\\s\\w+|\\w+\\s\\w+|\\w+|\\s)") %>%
str_extract("(?<=\\')\\D+")
# Let's get the Industry. Since it is in the Javascript, we need to use regex to extract the value
Ind <- URL_p %>% str_extract("(jobKnowledgeDomain: )'(\\w+\\s\\w+\\s\\w+|\\w+\\s\\w+|\\w+|\\s)") %>%
str_extract("(?<=\\')\\D+")
url <- p
temp <- c(Desc, Pos, Post, Sal, Loc, Cont, Comp, Compt, Ind, url)
all_jobs <- rbind(temp, all_jobs)
}
paste("Your dataframe has been built")
Once it is done running, you will want to add in column names, but otherwise, it is good to go!
N.B. There is an issue when pulling in jobs that are ‘featured’ that I cleaned up when looking at the data. You could challenge yourself and add a feature that does not pull in any featured pages or just do what I did. Either way, happy scraping!
-Josh