Finding and replacing strings of text in a DOM document is very tricky, especially if you’re looking to do it properly and entirely unobtrusively.

UPDATE: 4JUL/2012: I finally made a solution which deals with the caveats mentioned in this post.

In this context, “unobtrusive” means affecting the page in a minimally invasive manner — minimal DOM destruction, no un-called-for normalizing of text nodes (i.e. joining separate text-nodes together) etc.

Essentially, what I’m talking about is finding a specific piece of text in a page or a piece of text that matches a pattern and then doing something with it — possibly wrapping it in an element or changing it in some other way.

Imagine that we have the following paragraph:

<p>This order's reference number is RF83297.</p>

We want to locate the reference number and wrap it in an anchor that leads to the order’s page (pretend that we’re making a bookmarklet to enhance the admin panel of some online store). Given what we know about the HTML structure this can be quite easy:

(Using jQuery)

jQuery('p').each(function(){
 
    var p = $(this);
 
    p.html(
        p.text().replace(
            /bRFd{5}/g,
            '<a href="/order/$&">$&</a>'
        )
    )
 
});

The problem with this approach is that it breaks once you add any elements to any of the paragraphs on the page. It sets the content of the the element to the (replaced) text of that very element, by definition destroying any non-text nodes in the process, or, to put it poetically, it flattens the DOM structure of any given paragraph.

Additionally, the above technique does not care to avoid paragraphs that do not contain a match for the pattern — it’ll blindly flatten all elements within the paragraphs…

Sure, we could fix it up a bit by making sure that it only messes with paragraphs that contain a match for the pattern (/bRFd{5}/g) but we’re still faced with the obtrusive destruction of each matching paragraph’s inner-DOM structure.

If there happened to be another link in the paragraph, it’d be wiped:

Before replacement:

<p>
    <a href="/admin">Go back to admin!</a><br/>
    This order's reference number is RF83297.
</p>

After replacement:

<p>
    Go back to admin!
    This order's reference number is <a href="/order/RF83297">RF83297</a>.
</p>

So, to be appropriately blunt, this is obviously a terrible technique and should be avoided at all costs.

EDIT: The reason it would also be a bad idea to do something like p.html( p.html.replace(...) ); is because you’d have to have some way of avoiding text in HTML attributes/tags (which, in itself, would require a messy regular expression) and you’d be destroying what elements currently exist within the paragraph — yes, the HTML would be the same following the replacement, but the elements themselves would have changed — meaning that event handlers and other bound data would be lost.

The right way

Whether we like it or not, the correct way to handle text in a DOM structure is to get intimate with the text nodes themselves. Text nodes aren’t that different from element nodes (the ones you’re used to dealing with) — the key differences are:

  • Text nodes don’t have descendents (or childNodes).
  • Almost everything you would want to know about a text node will be contained in its data (or nodeValue) property, which contains whatever text the node encapsulates.
  • Text nodes don’t fire events, can’t have any styles applied, and are pretty useless at everything except holding a piece of text!

These text nodes will appear in the DOM structure just like element nodes. If we consider this paragraph again:

<p>
    <a href="/admin">Go back to admin!</a><br/>
    This order's reference number is RF83297.
</p>

Its DOM structure is as follows:

-> P ELEMENT
    -> TEXT NODE (data: "n   ")
    -> A ELEMENT (href: "/admin")
        -> TEXT NODE (data: "Go back to admin!")
    -> BR ELEMENT
    -> TEXT NODE (data: "n    This order's reference number is RF83297.n")

Firebug’s your friend:

We don’t know which text node we’ll find the reference number in so our only choice is to traverse and test all text nodes, including the nested ones. This in itself is quite simple:

// Pretending we have more than one paragraph to look through
jQuery('p').each(function(){
    traverseChildNodes(this);
});
 
function traverseChildNodes(node) {
 
    var next;
 
    if (node.nodeType === 1) {
 
        // (Element node)
 
        if (node = node.firstChild) {
            do {
                // Recursively call traverseChildNodes
                // on each child node
                next = node.nextSibling;
                traverseChildNodes(node);
            } while(node = next);
        }
 
    } else if (node.nodeType === 3) {
 
        // (Text node)
 
        if (/bRFd{5}/.test(node.data)) {
            // Do something interesting here
            alert('FOUND A MATCH!');
        }
 
    }
 
}

Finding a match isn’t too tricky, but replacing the matched portion of a text node with an element (wrapping it with the anchor) can be troublesome. Fortunately we can call on the useful innerHTML trick to create an out-of-DOM collection of nodes and add them one by one.

First, let’s adjust our traverseChildNodes so that it calls wrapMatchesInNode whenever it finds a node that matches:

// ....
} else if (node.nodeType === 3) {
 
    // (Text node)
 
    if (/bRFd{5}/.test(node.data)) {
        wrapMatchesInNode(node);
    }
 
}
// ....

wrapMatchesInNode:

function wrapMatchesInNode(textNode) {
 
    var temp = document.createElement('div');
 
    temp.innerHTML = textNode.data.replace(/bRFd{5}/g, '<a href="/order/$&">$&</a>');
 
    // temp.innerHTML is now:
    // "n    This order's reference number is <a href="/order/RF83297">RF83297</a>.n"
    // |_______________________________________|__________________________________|___|
    //                     |                                      |                 |
    //                 TEXT NODE                             ELEMENT NODE       TEXT NODE
 
    // Extract produced nodes and insert them
    // before original textNode:
    while (temp.firstChild) {
        console.log(temp.firstChild.nodeType);
        textNode.parentNode.insertBefore(textNode, temp.firstChild);
    }
    // Logged: 3,1,3
 
    // Remove original text-node:
    textNode.parentNode.removeChild(textNode);
 
}

That’ll work quite nicely. The result:

<p>
    <a href="/admin">Go back to admin!</a><br/>
    This order's reference number is <a href="/order/RF83297">RF83297</a>.
</p>

But now we face another problem…

A given chunk of text in a DOM snippet is not guaranteed to be encapsulated within a single text node. The HTML parsed by the browser will produce single DOM nodes from uninterrupted strings of text, such as what’s within this DIV:

<div>
    Nothing but text.
    No elements.
    No comments.
    Nada...
</div>

… But, any extra text appended to that element will not merge into the already-existing text node, unless, of course, you’re using innerHTML/innerText (/textContent) to set your content.

So, it’s possible — not likely, but possible that we will encounter our reference number in two or more text nodes. For example:

-> TEXT NODE (data: RF)
-> TEXT NODE (data: 832)
-> TEXT NODE (data: 97)

This occurance will not be noticed by our current solution, since it treats each text-node as its own “haystack”.

Before we discuss solutions to this, we must first cover other potential issues. Consider this:

<span>RF</span>83297

A similar problem occurs here — the text is separated, both parts in separate text-nodes — one nested in an element. It may seem as though I’m stretching the reference-number example but please remember that it’s just an example: what I’m trying to cover is the complexities involved in getting what is seemingly a simple operation to work correctly.

If we are to match a string of text, even when part of it is wrapped in arbitrary elements (like above), what elements should we allow? Surely not block-level elements? Well, of course, but this means that we need to test the computed style of every traversed element to see if its “inline” or otherwise.

Let’s consider an abstraction that solves all the covered issues. It let’s you search a specified root node for occurrences of a string and let’s you replace with whatever HTML you want — allowing you to wrap it in an anchor element, for example. But, what happens in cases like this:

<p>
    ...
    This order's reference <em>number is RF</em>83297.
</p>

Wrapping RF</em>83297 in <a>...</a> would produce invalid HTML…

So it seems that, while possible, a solution that protects against all of the potential problems is likely to be slow and, in the end, simply not worth it.

The only viable solution, in my eyes, is to forget about matching strings that are interrupted with HTML elements (like <em>RF</em>83297). Only test text-node data, and to handle neighbouring text nodes merge them together or to be truly “unobtrusive” collect up data in neighbouring text nodes and test the collected text at the end.

That’s really all I have to say on the matter. I hope I didn’t coerce you down the rabbit hole with the false promise of an all-encompassing solution… I simply don’t have one!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!