X-Ray: A Scraper by the Author of Cheerio

2015-02-05 00:00:00 +0000 by Alex R. Young


My favourite general-purpose web scraping utility has to be cheerio, a module that provides the simplicity of CSS selectors with fast and forgiving HTML parsing. The author, Matthew Mueller, has now released x-ray (GitHub: lapwinglabs/x-ray, License: MIT, npm: x-ray), a module specifically built for scraping.

The x-ray module depends on Cheerio, but includes some methods to automate common scraping tasks: it can convert collections of items into JavaScript objects, select objects based on flexible schemas, and it can even paginate based on CSS selectors.

Each of these features is well thought out: pagination supports a delay between requests, and you can set a limit to avoid following too many pages. It also integrates well with Node: you can stream to files, for example.

The data sources are flexible as well. If you need to scrape dynamic pages, then you can use x-ray's PhantomJS driver, otherwise superagent is used.

Matthew's example fetches a user's "stars" page from GitHub and then extracts metadata like the repository link, description, and date:

    $root: '.repo-list-item',
    title: '.repo-list-name',
    link: '.repo-list-name a[href]',
    description: '.repo-list-description',
    meta: {
      $root: '.repo-list-meta',
      starredOn: 'time'
  .paginate('.pagination a:last-child[href]')

One other cool feature of x-ray is it supports output formats. That means you can easily download a page, structure it, then generate JSON, XML, RSS, Atom, CSV, etc.

When compared to using a chaotic mix of HTTP and DOM-parsing modules, x-ray gives scraping a more rigorous structure while remaining flexible. That means it would be possible to share x-ray schemas as Node modules, so the community could collaborate on wrapping useful websites with poor or non-existent APIs. It's true that some people regard scraping as a dark art that borders on copyright theft (if used without permission), but Matthew carefully calms this notion by using the slogan "structure any website".