📕
SEIRFX
  • Introduction
  • About These Notes
  • Schedule
  • Unit 2
    • Node
      • Internet Fundamentals
      • Full-Stack Fundamentals
      • Intro to Node
      • Node Modules
      • Node Packages
    • Express
      • Intro to Express
      • Routes
      • Routes Lab
      • Views
      • Templates
      • Layouts & Controllers
    • CRUD & REST
      • GET & POST
      • GET & POST Lab
      • PUT & DELETE
    • API Calls in Express
      • Axios
      • Request (no longer maintained)
    • Sequelize
      • Terminology
      • Setup
      • Using Models
      • Seeding Data
      • Validations and Migrations
      • Resources
      • 1:M Relationships
      • N:M Relationships
    • Express Authentication
      • Research Components
      • Code Components
      • Auth in Theory
        • Sessions
        • Passwords
        • Middleware
        • Hooks
      • Auth in Practice
        • Create the User
        • User Signup
        • Sessions
        • User Login
        • Authorization and Flash messages
  • Development Workflow
    • Command Line
      • The Terminal
      • Filesystem Navigation
      • File Manipulation
      • Additional Topics
    • Intro to Git
      • Version Control
      • Local Git
      • Remote Git
      • Git Recipes
    • Group Collaboration
      • Git Workflows
      • Project Roles and Tools
    • VS Code Tips & Tricks
  • HTML/CSS
    • HTML
    • CSS Selectors
    • CSS Box Model and Positioning
      • Box Model
      • Display and Positioning
      • Flexbox
      • Grid
      • Flexbox & Grid Games
      • Floats and Clears
      • Additional Topics
    • Advanced CSS
      • Responsive Design
      • Pseudo-Classes/Elements
      • Vendor Prefixes
      • Custom Properties
      • Additional Topics
    • Bootstrap
    • CSS Frameworks
    • Accessibility
  • JavaScript
    • Primitives
    • Arrays
    • Objects
      • Objects Lesson
      • Objects quick guide
      • Object-ception
    • Control Flow
      • Boolean Expressions
      • Conditionals
      • Loops
      • Promises
    • Functions
      • Scope
      • Callbacks
      • Higher Order Functions
      • Callbacks Review Lab
      • Timing Functions
      • Iterators
      • Combining Data Types
      • Combining Data Types Lab
    • Javascript in the browser
      • DOM and Events
      • DOM Manipulation
      • DOM Review
      • DOM Review Lab
      • HP DOM Lab
      • Programmatic DOM Manipulation
      • Grids & Pyramids
      • DOM & Data
      • DOM Events
      • Color Palette Picker
      • Sketchpad
    • HTML5 Canvas
    • How To Reduce Redundancy
    • OOP
      • Westworld Lab
      • OOP Factories
      • OOP Inheritance
      • OOP Inheritance Lab
      • Tomagotchi Lab
      • OOP Space Battle
      • OOP Snowman
      • (2019) JavaScript OOP
      • (2016) OOP with Classes
      • (1995) OOP with Prototypes
      • Constructors
      • Prototypes
    • Intro to TDD
    • Scoping
    • Inheritance
      • Prototypal Inheritance
      • Call, Apply, and other Functions
      • ES6 Inheritance
      • Resources
    • Custom Node Modules
    • Additional Topics
      • AJAX, Fetch, and Async/Await
      • AJAX w/JSON and Localstorage
        • AJAX w/JSON
        • Local Storage
      • Async module
      • Data Scraping
  • jQuery
    • Intro
      • DOM Manipulation
      • Reddit Practice
      • Styling
      • Events
    • Plugins
    • AJAX
  • APIs
    • Fetch
    • AJAX w/jQuery
    • AJAX w/Fetch
  • Databases
    • Intro to SQL
    • Advanced SQL
    • MongoDB
      • Intro to NoSQL
      • CRUD in MongoDB
      • Data Modeling
      • Intermediate Mongo
  • Left over Node/Express
    • Testing with Mocha and Chai
    • Mongoose
      • Mongoose Associations
    • JSON Web Tokens
      • Codealong
    • Additional Topics
      • oAuth
      • Geocoding with Mapbox
      • Geocoding and Google Maps
      • Cloudinary
      • Websockets with Socket.io
      • SASS
  • Ruby
    • Intro to Ruby
    • Ruby Exercises
    • Ruby Classes
    • Ruby Testing with Rspec
    • Ruby Inheritance
    • Ruby Data Scraping
  • Ruby on Rails
    • Intro to Rails
    • APIs with Rails
    • Asset Pipeline
    • Rails Auth and 1-M
      • Auth Components
    • Rails N:M
    • ActiveRecord Polymorphism
    • Additional Topics
      • oAuth
      • SASS
      • Rails Mailers
      • Cloudinary
      • Jekyll
  • React (Updated 2019)
    • ES6+/ESNext
      • Const and Let
      • Arrow Functions
      • Object Literals and String Interpolation
      • ES6 Recap
      • ES6 Activity
    • Intro to React
      • Create React App
      • Components and JSX
      • Virtual DOM
      • Props
      • Dino Blog Activity
      • Nested Components
      • Lab: LotR
    • React State
      • Code-Along: Edit Dino Blog
      • Lab: Simple Calc
      • Lifting State
    • React Router
      • Browser History/SPAs
      • React Router (lesson and full codealong)
      • Router Lab
    • Fetch and APIs
      • APIs with Fetch and Axios
      • Fetch the Weather
    • React Hooks
    • React LifeCycle
      • Lab: Component LifeCycle
    • React Deployment
    • Additional Topics
      • React Frameworks
        • Material UI Theming
      • Typescript
        • More Types and Syntax
        • Tsconfig and Declaration Files
        • Generics with Linked List
      • Redux
      • TypeScript
      • Context API
      • React Native
  • Meteor
  • Deployment and Config
    • Installfest
      • Mac OSX
      • Linux
      • Git Configuration
      • Sublime Packages
    • Deploy - Github Pages
    • Deploy - Node/Sequelize
    • Deploy - Node/MongoDB
    • Deploy React
    • Deploy - Rails
      • Foreman (Environment Variables)
    • Deploy - AWS Elastic Beanstalk
    • Deploy - S3 Static Sites
    • Deploy - Django
    • Deploy - Flask
  • Data Structures and Algorithms
    • Recursion
    • Problem Solving - Array Flatten
    • Binary Search
    • Algorithm Complexity
    • Stacks and Queues
    • Bracket Matching
    • Ruby Linked Lists
      • Sample Code
      • Beginner Exercises
      • Advanced Exercises
    • JS Linked Lists
      • Sample Code
      • Beginner Exercises
      • Beginner Solutions
    • Hash Tables
    • Intro to Sorting
    • Insertion Sort
    • Bucket Sort
    • Bubble Sort
    • Merge Sort
    • Quick Sort
    • Heap Sort
    • Sorting Wrapup
    • Hashmaps
    • Trees and Other Topics
  • Python
    • Python Installation
    • Intro to Python
    • Python Lists
    • Python Loops
    • Python Dictionaries
    • Python Sets and Tuples
    • Python Cheatsheet
    • Python Functions
    • Python Classes
    • Python Class Inheritance
    • Intro to Flask
    • Intro to SQLAlchemy
      • Flask and SQLAlchemy
    • Using PyMongo
    • Intro to Django
    • CatCollector CodeAlong
      • URLs, Views, Templates
      • Models, Migrations
      • Model Form CRUD
      • One-to-Many Relations
      • Many-to-Many Relations
      • Django Auth
    • Django Cheatsheet
    • Django Auth
    • Django Polls App Tutorial
    • Django School Tool Tutorial
    • Django 1:M Relationships
    • Custom Admin Views
    • Data Structures and Algorithms
      • Recursion
      • Binary Search
      • Stacks and Queues
      • Linked Lists
      • Binary Trees
      • Bubble Sort
      • TensorFlow & Neural Networks
    • Adjacent Topics
      • Raspberry Pi
      • Scripting
  • Assorted Topics
    • History of Computer Science
    • Regular Expressions
    • Being Successful in SEI
    • Internet Fundamentals
      • Internet Lab
    • Adjacent Workflow
      • UX/UI
      • Wireframing Exercise: Build an Idea
      • Agile
    • Post SEI
      • Learning Resources
      • Deliverables -> Portfolio
      • FAQ
  • Projects
    • Project 1
    • Project 2
    • Project 3
      • Project 3 Pitch Guidelines
    • Project 4
    • Past Projects
      • Project 1
      • Project 2
      • Project 3
      • Project 4
      • Portfolios
    • Post Project 2
    • MEAN Hackathon
      • Part 1: APIs
      • Part 2: Angular
    • Portfolio
  • Web Development Trends
  • Resources
    • APIs and Data
    • Tech Websites
    • PostgreSQL Cheat Sheet
    • Sequelize Cheat Sheet
    • Database Administration
  • Archived Section
    • (Archived) ReactJS
      • Intro to React
        • Todo List Codealong
        • Additional Topics
      • Deploy React
      • React with Gulp and Browserify
        • Setting up Gulp
        • Additional Gulp Tasks
      • React Router
        • OMDB Router
        • OMDB Search
        • Additional Resources
      • React Animations
        • CSS Animations
    • AngularJS
      • Intro to AngularJS
        • Components and SPA
        • Create an Angular App
      • Angular Directives and Filters
      • Angular Animation
      • Angular Bootstrap Directives
        • Bootstrap Modals
      • Angular $http
      • Angular Services
        • Service Recipes
        • ngResource
        • Star Wars Codealong
      • Angular Routing
      • Angular + Express
      • Angular Authentication
        • Additional Topics
      • Angular Components
      • Angular Custom Filters
      • Angular Custom Directives
Powered by GitBook
On this page
  • Objectives
  • Nokogiri
  • Using local data
  • Using cURL to download the data
  • Loading a local file

Was this helpful?

  1. Ruby

Ruby Data Scraping

PreviousRuby InheritanceNextRuby on Rails

Last updated 4 years ago

Was this helpful?

Objectives

  • Review rest-client and how it can be used to make HTTP requests

  • Discuss the process and need for data scraping

  • Use the Nokogiri parser and RestClient to obtain and parse data

Data scraping is the process of extracting data from a web site. This is used primarily when the data is not available from an existing API. Data is extracted by taking the HTML file and trying to extract certain blocks of text or data. We'll be using a Ruby gem called Nokogiri in order to scrape a page.

Nokogiri

Nokogiri is a parser for HTML and XML. It allows us to process a HTML file (locally or on a remote server) and extract the data from it using CSS selectors. This is handy because the process is analagous to selecting data using jQuery or querySelectorAll.

In a new folder, let's install the Nokogiri and rest-client gems and try scraping a site.

Gem Installation (or place into your Gemfile, if using Rails)

gem install nokogiri
gem install rest-client

In a new file, we need to do the following:

  • Get a HTML file (hint: 'rest-client')

  • Give Nokogiri the HTML file

  • Select data using CSS selectors

In a file

# require the gems we need
require 'rest-client'
require 'nokogiri'

#request data
response = RestClient.get("http://reddit.com/r/aww")
html = response.body

#give Nokogiri your file
data = Nokogiri::HTML(html)

#select data using CSS selectors
data.css('p.title a.title').each do |title|
  puts title.text
end

Let's analyze a few things we do here with Nokogiri

  • We pass our HTML string into Nokogiri::HTML. HTML is a class inside of the Nokogiri module, and this class parses HTML files.

  • To select specific DOM elements, we can take the parsed file and call the .css method. It's also possible to iterate through the results.

  • Can you only get text? Nope! There's other methods associated with Nokogiri.

Using local data

When we are figuring out how to scrape a site we usually have to do a lot of trial and error. Which means reloading the data over and over. This could raise red flags with the site administrator and could possibly get your IP banned.

Luckily, we can avoid this by downloading the data for a page and using that file while we figure out how to scrape the data we need.

Using cURL to download the data

We can use cURL to download HTML files and save them locally on our computers. By using the -o flag, we can specify the local file name and the URL we want to download.

curl -o reddit_aww.html https://www.reddit.com/r/aww

This example downloads data from https://www.reddit.com/r/aww and stores it in reddit_aww.html. Once the data has been downloaded, we'll want to include it in a different manner inside our Ruby file.

Loading a local file

require 'nokogiri'

# open the file
reddit_html = File.open('reddit_aww.html')

# give Nokogiri the file
data = Nokogiri::HTML(reddit_html)

This opens the HTML file and stores the new file object to reddit_html. Then, we pass the file to Nokogiri, similar to before.

http://www.nokogiri.org/
http://www.nokogiri.org/tutorials/searching_a_xml_html_document.html