DailyJS

Let's Make a Framework: Selectors Part 2

Alex R. Young

Subscribe

@dailyjs

Facebook

Google+

tutorials frameworks web lmaf

Let's Make a Framework: Selectors Part 2

Posted by Alex R. Young on .
Featured

tutorials frameworks web lmaf

Let's Make a Framework: Selectors Part 2

Posted by Alex R. Young on .

Welcome to part 7 of Let's Make a Framework, the ongoing series about
building a JavaScript framework. This part continues looking at CSS
selectors.

If you haven't been following along, these articles are tagged with
lmaf. The project we're creating is called Turing and is available on GitHub:
turing.js.

In this part I'll demonstrate implementing a selector engine. It's
harder than it looks, but I'll try to keep focused so it doesn't get too
complicated.

I've based this selector engine on basic parser concepts, which means
you should learn some basic computer science principles here as well as
getting down and dirty with the DOM, CSS grammar, and how browsers
search using selectors.

CSS Selectors

Before trying to parse anything, it's a good idea to become familiar
with the input. CSS selectors aren't trivial, but to mitigate
unnecessary complications I'll limit our library to a subset of CSS2 for
now.

CSS2 selectors are explained in detail by in the spec: Selectors:
Pattern Matching
and Appendix
G. Grammar of CSS 2.1
. These
documents are readable, don't be afraid to skim them!

I want to initially focus on the following syntaxes:

  • E - Matches any element named E
  • E F - Matches any element F that descends from E
  • .class - Matches any element with the class class
  • E.class - Matches any element named E with the class class
  • #id - Matches any element with the id id
  • E#id - Matches any element named E with the id id

Any one of these rules forms a simple selector. Simple selectors can
be chained with combinators: white space, '>', and '+'.

Parsing and Searching Strategy

A good place to look for parsing strategies (other than existing
selector engines) is browsers. The Mozilla developer site has an article
entitled Writing Efficient
CSS
that
actually explains how the style system matches rules.

It breaks up selectors into four categories:

  1. ID Rules
  2. Class Rules
  3. Tag Rules
  4. Universal Rules

The last part of a selector is called the key selector. The ancestors
of the key selector are analysed to determine which elements match the
entire selector. This allows the engine to ignore rules it doesn't need.
The upshot of this is that element#idName would perform
slower than #idName.

This algorithm isn't necessarily the fastest -- many rules would rely on
getElementsByTagName returning a lot of elements. However,
it's an incredibly easy to understand and pragmatic approach.

Rather than branching off rule categories, we could just put the
categories in an object:

  findMap = {
    'id': function(root, selector) {
    },

    'name and id': function(root, selector) {
    },

    'name': function(root, selector) {
    },

    'class': function(root, selector) {
    },

    'name and class': function(root, selector) {
    }
  };

Tokenizer

Tokens are just categorised strings of characters. This stage of a
parser is called a lexical analyser. It sounds more complicated than
it is. Given a selector, we need to:

  • Normalise it to remove any extra white space
  • Process it by some means to transform it into a sequence of instructions (tokens) for our parser
  • Run the tokens through the parser against the DOM

We can model a token as a class like this:

function Token(identity, finder) {
  this.identity = identity;
  this.finder   = finder;
}

Token.prototype.toString = function() {
  return 'identity: ' + this.identity + ', finder: ' + this.finder;
};

The finder property is one of those keys in
findMap. The identity is the original rule
from the selector.

Scanner

Both Sly and
Sizzle use a giant regular expression to break apart and process a selector. Sizzle calls this the
Chunker.

Given the performance and flexibility of regular expressions in
JavaScript, this is a good approach. However, I don't want to confuse
readers with a giant regular expression. What we need is an intermediate
approach.

Most programming languages have tools that generate tokenizers based on
lexical patterns. Typically a lexical analyser outputs source code based
on the output of a parser generator

Lexers were traditionally used for building programming language
parsers, but we now live in a world filled with a gamut of
computer-generated data. This means that you'll even find these tools
cropping up in projects like nokogiri, a Ruby
HTML and XML parser.

The power of lexers lies in the fact that there's a layer of abstraction
between the programmer and the parser. Working on simple rules is easier
than figuring out the finer implementation details.

Let's use an incredibly simplified version of a lexer to create the
regular expression that drives the tokenizer. These rules can be based
on the lexical scanner description in the CSS grammer
specification
.

It will be useful to embed regular expressions within other ones, and
not worry about escaping them. Objects like this will drive the process:

macros = {
  'nl':        '\n|\r\n|\r|\f',
  'nonascii':  '[^\0-\177]',
  'unicode':   '\\[0-9A-Fa-f]{1,6}(\r\n|[\s\n\r\t\f])?',
  'escape':    '#{unicode}|\\[^\n\r\f0-9A-Fa-f]',
  'nmchar':    '[_A-Za-z0-9-]|#{nonascii}|#{escape}',
  'nmstart':   '[_A-Za-z]|#{nonascii}|#{escape}',
  'ident':     '[-@]?(#{nmstart})(#{nmchar})*',
  'name':      '(#{nmchar})+'
};

rules = {
  'id and name':    '(#{ident}##{ident})',
  'id':             '(##{ident})',
  'class':          '(\\.#{ident})',
  'name and class': '(#{ident}\\.#{ident})',
  'element':        '(#{ident})',
  'pseudo class':   '(:#{ident})'
};

The scanner will work by:

  • Expanding #{} in the macros
  • Expanding #{} in the rules based on the expanded macros
  • Escaping the backslashes
  • Joining each of the patterns with |
  • Building a global regular expression with the RegExp class

This will output a giant regular expression much like the ones used by
Sizzle and Sly. The advantage of this approach is you can see the
relationship between the tokenized output and the DOM
matchers/searchers.

Processing the Giant Regular Expression

After a selector is normalised, it will be broken apart using the
regular expression from the scanner. This works based on the indexes of
the matched elements:

while (match = r.exec(this.selector)) {
  finder = null;

  if (match[10]) {
    finder = 'id';
  } else if (match[1]) {
    finder = 'name and id';
  } else if (match[29]) {
    finder = 'name';
  } else if (match[15]) {
    finder = 'class';
  } else if (match[20]) {
    finder = 'name and class';
  }
  this.tokens.push(new Token(match[0], finder));
}

Despite being obtuse, This is more efficient than looking at
match[0] with each of the regexes in the
rules object.

Next Week

The Searcher class implements the algorithm that Firefox
uses. It's simple enough that turing.dom.js currently passes tests in
IE6, IE7, IE8, Firefox, Safari, Chrome and Opera. I'll explain this next
week, and also demonstrate how I used test-driven development to develop
the parser and tokenizer.

To see the most recent code, have a look at
turing.dom.js on GitHub.