DailyJS

Natural Language Parsing with Retext

Alex R. Young

Subscribe

@dailyjs

Facebook

Google+

parsing text

Natural Language Parsing with Retext

Posted by Alex R. Young on .
Featured

parsing text

Natural Language Parsing with Retext

Posted by Alex R. Young on .

Retext (GitHub: wooorm / retext, License: MIT, npm: retext) by Titus Wormer is an extensible module for analysing and manipulating natural language text. It's built on two other modules by the same author. One is TextOM, which provides an object system for manipulating text, and the other is ParseLatin.

Given some text, ParseLatin returns syntax trees:

parseLatin.parse('A simple sentence.');  
/*
 * ˅ Object
 *    ˃ children: Array[1]
 *      type: "RootNode"
 *    ˃ __proto__: Object
 */

These trees can then be processed as required. You can iterate over nodes or search them for values, it's a bit like a DOM for plain text (or syntax/grammar).

The Retext module has lots of plugins. One example is an implementation of the Metaphone algorithm -- retext-double-metaphone. There's also a short-code emoji parser, so you can actually build tightly focused text-processing modules with Retext. Another similar plugin is a typographic parsing library, which converts ASCII to HTML entities.

One cool use of Retext would be natural language date parsing, which is something that in my experience always ends up in a horrible mess of regular expressions. The author is still looking for a "retext-date" implementation, so it would be interesting to see what that looks like in Retext.