Shortest path to a functional webscraper using Selenium + Nokogiri + Rails + Heroku
February 27, 2022
1. Create a new Rails 7 application:
rails new webscraper --database=postgresql
2. Move the following gems outside of the test group and bundle:
gem 'selenium-webdriver'
gem 'webdrivers'
3. Run the following to add Linux as a supported platform within your Gemfile.lock, this is necessary to support the Heroku deployment:
bundle lock --add-platform x86_64-linux
4. Add a new file under ./app/scrapers/scraper.rb
with the following code:
class Scraper
def scrape
require "selenium-webdriver"
Selenium::WebDriver::Chrome.path = ENV["GOOGLE_CHROME_BIN"] if Rails.env.production?
arguments = %w[--headless --no-sandbox --disable-gpu]
options = Selenium::WebDriver::Chrome::Options.new(args: arguments)
driver = Selenium::WebDriver.for(:chrome, options: options)
driver.get("https://quotes.toscrape.com/js/")
doc = Nokogiri::HTML(driver.page_source)
doc.css('.quote').each do |link|
puts link.content
end
driver.quit
end
end
5. Create a rake task under ./lib/tasks/scraper.rb
with the following code:
namespace :scraper do
desc "Scrape"
task scrape: :environment do
scraper = Scraper.new
scraper.scrape
end
end
6. Test the scraper locally:
rake scraper:scrape
7. Commit your changes:
git add -A
git commit -m "initial"
8. Next create a new Heroku app and add the following build packs:
heroku create
heroku buildpacks:add --index 1 heroku/ruby
heroku buildpacks:add --index 2 heroku/chromedriver
heroku buildpacks:add --index 3 heroku/google-chrome
9. After deploying, test the scaper in production:
heroku run rake scraper:scrape
10. If running the scraper as a recurring job, set up a new job using Heroku Scheduler.