Natural Language Parsing with Retext

31 Jul 2014 | By Alex Young | Tags parsing text

Retext (GitHub: wooorm / retext, License: MIT, npm: retext) by Titus Wormer is an extensible module for analysing and manipulating natural language text. It’s built on two other modules by the same author. One is TextOM, which provides an object system for manipulating text, and the other is ParseLatin.

Given some text, ParseLatin returns syntax trees:

parseLatin.parse('A simple sentence.');
/*
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

These trees can then be processed as required. You can iterate over nodes or search them for values, it’s a bit like a DOM for plain text (or syntax/grammar).

The Retext module has lots of plugins. One example is an implementation of the Metaphone algorithmretext-double-metaphone. There’s also a short-code emoji parser, so you can actually build tightly focused text-processing modules with Retext. Another similar plugin is a typographic parsing library, which converts ASCII to HTML entities.

One cool use of Retext would be natural language date parsing, which is something that in my experience always ends up in a horrible mess of regular expressions. The author is still looking for a “retext-date” implementation, so it would be interesting to see what that looks like in Retext.


blog comments powered by Disqus