Shortest path to a functional webscraper using Selenium + Nokogiri + Rails + Heroku
February 27, 2022
- Create a new Rails 7 application:
$ rails new webscraper --database=postgresql
- Move the following gems outside of the test group and bundle:
gem 'selenium-webdriver' gem 'webdrivers'
- Run the following to add Linux as a supported platform within your Gemfile.lock, this is necessary to support the Heroku deployment:
$ bundle lock --add-platform x86_64-linux
- Add a new file under
./app/scrapers/scraper.rb
with the following code:class Scraper def scrape require "selenium-webdriver" Selenium::WebDriver::Chrome.path = ENV["GOOGLE_CHROME_BIN"] if Rails.env.production? arguments = %w[--headless --no-sandbox --disable-gpu] options = Selenium::WebDriver::Chrome::Options.new(args: arguments) driver = Selenium::WebDriver.for(:chrome, options: options) driver.get("https://quotes.toscrape.com/js/") doc = Nokogiri::HTML(driver.page_source) doc.css('.quote').each do |link| puts link.content end driver.quit end end
- Next create a rake task under
./lib/tasks/scraper.rb
with the following code:namespace :scraper do desc "Scrape" task scrape: :environment do scraper = Scraper.new scraper.scrape end end
-
Run
$ rake scraper:scrape
and test that the scraper is functioning locally. - Commit your changes:
git add -A git commit -m "initial"
- Next create a new Heroku app and add the following build packs:
$ heroku create $ heroku buildpacks:add --index 1 heroku/ruby $ heroku buildpacks:add --index 2 heroku/chromedriver $ heroku buildpacks:add --index 3 heroku/google-chrome
- After deploying, run the following to test that the scraper is functioning in production:
heroku run rake scraper:scrape
- If running the scraper as a recurring job, set up a new job using Heroku Scheduler: https://devcenter.heroku.com/articles/scheduler