Kenneth Transier

Shortest path to a functional webscraper using Selenium + Nokogiri + Rails + Heroku

February 27, 2022

Create a new Rails 7 application:

 $ rails new webscraper --database=postgresql

Move the following gems outside of the test group and bundle:
```
gem 'selenium-webdriver'
gem 'webdrivers'
```
Run the following to add Linux as a supported platform within your Gemfile.lock, this is necessary to support the Heroku deployment:
```
$ bundle lock --add-platform x86_64-linux
```

Add a new file under ./app/scrapers/scraper.rb with the following code:

class Scraper
 def scrape
     require "selenium-webdriver"
     Selenium::WebDriver::Chrome.path = ENV["GOOGLE_CHROME_BIN"] if Rails.env.production?

     arguments = %w[--headless --no-sandbox --disable-gpu]
     options = Selenium::WebDriver::Chrome::Options.new(args: arguments)
     driver = Selenium::WebDriver.for(:chrome, options: options)

     driver.get("https://quotes.toscrape.com/js/")

     doc = Nokogiri::HTML(driver.page_source)
     doc.css('.quote').each do |link|
       puts link.content
     end

     driver.quit
 end
end

Next create a rake task under ./lib/tasks/scraper.rb with the following code:

namespace :scraper do
  desc "Scrape"
  task scrape: :environment do
 scraper = Scraper.new
 scraper.scrape
  end
end

Run $ rake scraper:scrape and test that the scraper is functioning locally.
Commit your changes:
```
git add -A
git commit -m "initial"
```

Next create a new Heroku app and add the following build packs:

$ heroku create
$ heroku buildpacks:add --index 1 heroku/ruby
$ heroku buildpacks:add --index 2 heroku/chromedriver
$ heroku buildpacks:add --index 3 heroku/google-chrome

After deploying, run the following to test that the scraper is functioning in production:
```
heroku run rake scraper:scrape
```
If running the scraper as a recurring job, set up a new job using Heroku Scheduler: https://devcenter.heroku.com/articles/scheduler