Finding and replacing strings of text in a DOM document is very tricky, especially if you’re looking to do it properly and entirely unobtrusively.
UPDATE: 4JUL/2012: I finally made a solution which deals with the caveats mentioned in this post.
In this context, “unobtrusive” means affecting the page in a minimally invasive manner — minimal DOM destruction, no un-called-for normalizing of text nodes (i.e. joining separate text-nodes together) etc.
Essentially, what I’m talking about is finding a specific piece of text in a page or a piece of text that matches a pattern and then doing something with it — possibly wrapping it in an element or changing it in some other way.
Imagine that we have the following paragraph:
<p>This order's reference number is RF83297.</p> |
We want to locate the reference number and wrap it in an anchor that leads to the order’s page (pretend that we’re making a bookmarklet to enhance the admin panel of some online store). Given what we know about the HTML structure this can be quite easy:
(Using jQuery)
jQuery('p').each(function(){ var p = $(this); p.html( p.text().replace( /bRFd{5}/g, '<a href="/order/$&">$&</a>' ) ) }); |
The problem with this approach is that it breaks once you add any elements to any of the paragraphs on the page. It sets the content of the the element to the (replaced) text of that very element, by definition destroying any non-text nodes in the process, or, to put it poetically, it flattens the DOM structure of any given paragraph.
Additionally, the above technique does not care to avoid paragraphs that do not contain a match for the pattern — it’ll blindly flatten all elements within the paragraphs…
Sure, we could fix it up a bit by making sure that it only messes with paragraphs that contain a match for the pattern (/bRFd{5}/g
) but we’re still faced with the obtrusive destruction of each matching paragraph’s inner-DOM structure.
If there happened to be another link in the paragraph, it’d be wiped:
Before replacement:
<p> <a href="/admin">Go back to admin!</a><br/> This order's reference number is RF83297. </p> |
After replacement:
<p> Go back to admin! This order's reference number is <a href="/order/RF83297">RF83297</a>. </p> |
So, to be appropriately blunt, this is obviously a terrible technique and should be avoided at all costs.
EDIT: The reason it would also be a bad idea to do something like p.html( p.html.replace(...) );
is because you’d have to have some way of avoiding text in HTML attributes/tags (which, in itself, would require a messy regular expression) and you’d be destroying what elements currently exist within the paragraph — yes, the HTML would be the same following the replacement, but the elements themselves would have changed — meaning that event handlers and other bound data would be lost.
The right way
Whether we like it or not, the correct way to handle text in a DOM structure is to get intimate with the text nodes themselves. Text nodes aren’t that different from element nodes (the ones you’re used to dealing with) — the key differences are:
- Text nodes don’t have descendents (or
childNodes
). - Almost everything you would want to know about a text node will be contained in its
data
(ornodeValue
) property, which contains whatever text the node encapsulates. - Text nodes don’t fire events, can’t have any styles applied, and are pretty useless at everything except holding a piece of text!
These text nodes will appear in the DOM structure just like element nodes. If we consider this paragraph again:
<p> <a href="/admin">Go back to admin!</a><br/> This order's reference number is RF83297. </p> |
Its DOM structure is as follows:
-> P ELEMENT -> TEXT NODE (data: "n ") -> A ELEMENT (href: "/admin") -> TEXT NODE (data: "Go back to admin!") -> BR ELEMENT -> TEXT NODE (data: "n This order's reference number is RF83297.n") |
Firebug’s your friend:
We don’t know which text node we’ll find the reference number in so our only choice is to traverse and test all text nodes, including the nested ones. This in itself is quite simple:
// Pretending we have more than one paragraph to look through jQuery('p').each(function(){ traverseChildNodes(this); }); function traverseChildNodes(node) { var next; if (node.nodeType === 1) { // (Element node) if (node = node.firstChild) { do { // Recursively call traverseChildNodes // on each child node next = node.nextSibling; traverseChildNodes(node); } while(node = next); } } else if (node.nodeType === 3) { // (Text node) if (/bRFd{5}/.test(node.data)) { // Do something interesting here alert('FOUND A MATCH!'); } } } |
Finding a match isn’t too tricky, but replacing the matched portion of a text node with an element (wrapping it with the anchor) can be troublesome. Fortunately we can call on the useful innerHTML
trick to create an out-of-DOM collection of nodes and add them one by one.
First, let’s adjust our traverseChildNodes
so that it calls wrapMatchesInNode
whenever it finds a node that matches:
// .... } else if (node.nodeType === 3) { // (Text node) if (/bRFd{5}/.test(node.data)) { wrapMatchesInNode(node); } } // .... |
wrapMatchesInNode
:
function wrapMatchesInNode(textNode) { var temp = document.createElement('div'); temp.innerHTML = textNode.data.replace(/bRFd{5}/g, '<a href="/order/$&">$&</a>'); // temp.innerHTML is now: // "n This order's reference number is <a href="/order/RF83297">RF83297</a>.n" // |_______________________________________|__________________________________|___| // | | | // TEXT NODE ELEMENT NODE TEXT NODE // Extract produced nodes and insert them // before original textNode: while (temp.firstChild) { console.log(temp.firstChild.nodeType); textNode.parentNode.insertBefore(textNode, temp.firstChild); } // Logged: 3,1,3 // Remove original text-node: textNode.parentNode.removeChild(textNode); } |
That’ll work quite nicely. The result:
<p> <a href="/admin">Go back to admin!</a><br/> This order's reference number is <a href="/order/RF83297">RF83297</a>. </p> |
But now we face another problem…
A given chunk of text in a DOM snippet is not guaranteed to be encapsulated within a single text node. The HTML parsed by the browser will produce single DOM nodes from uninterrupted strings of text, such as what’s within this DIV
:
<div> Nothing but text. No elements. No comments. Nada... </div> |
… But, any extra text appended to that element will not merge into the already-existing text node, unless, of course, you’re using innerHTML
/innerText
(/textContent
) to set your content.
So, it’s possible — not likely, but possible that we will encounter our reference number in two or more text nodes. For example:
-> TEXT NODE (data: RF) -> TEXT NODE (data: 832) -> TEXT NODE (data: 97) |
This occurance will not be noticed by our current solution, since it treats each text-node as its own “haystack”.
Before we discuss solutions to this, we must first cover other potential issues. Consider this:
<span>RF</span>83297 |
A similar problem occurs here — the text is separated, both parts in separate text-nodes — one nested in an element. It may seem as though I’m stretching the reference-number example but please remember that it’s just an example: what I’m trying to cover is the complexities involved in getting what is seemingly a simple operation to work correctly.
If we are to match a string of text, even when part of it is wrapped in arbitrary elements (like above), what elements should we allow? Surely not block-level elements? Well, of course, but this means that we need to test the computed style of every traversed element to see if its “inline” or otherwise.
Let’s consider an abstraction that solves all the covered issues. It let’s you search a specified root node for occurrences of a string and let’s you replace with whatever HTML you want — allowing you to wrap it in an anchor element, for example. But, what happens in cases like this:
<p> ... This order's reference <em>number is RF</em>83297. </p> |
Wrapping RF</em>83297
in <a>...</a>
would produce invalid HTML…
So it seems that, while possible, a solution that protects against all of the potential problems is likely to be slow and, in the end, simply not worth it.
The only viable solution, in my eyes, is to forget about matching strings that are interrupted with HTML elements (like <em>RF</em>83297
). Only test text-node data, and to handle neighbouring text nodes merge them together or to be truly “unobtrusive” collect up data in neighbouring text nodes and test the collected text at the end.
That’s really all I have to say on the matter. I hope I didn’t coerce you down the rabbit hole with the false promise of an all-encompassing solution… I simply don’t have one!
Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!
Interesting read, James!
I created a search-in-page search bar bookmarklet for iOS (yeah, I know…) devices (and potentially Android) the other day, and ran in to many of the problems you describe. I did come up with a similar solution, though not as well-reasoned as yours. If you’d be interested, it’s available here: http://github.com/krawaller/prettySearch.js as an early release. Please feel free to bash it – I could definitely use your help 🙂
I saw that manually creating text nodes may lead to neighbouring text nodes, but when does this happen in the wild? Also: currently, I normalize the elements only when I revert the search – how bad is this?
A nice warning to show that libraries can’t cover all the bases.
Thanks for the article, I found it very interesting as I’ve tried to solve your last case, alas unsuccessfully. Agreed it’s not possible to solve your last case ‘unobtrusively’, without a heck-load of DOM scripting. A good working example of the type of logic involved is spell-checking in WYSIWYG editors. TinyMCE spellchecker does not bother (probably for the same reasons you mentioned), but CKEditor manages to solve this, it’s fascinating to see the final markup:
initial html is: <em>number is RF</em>83297
final html is: <em>number is </em><em><a href="/order/RF83297">RF</a></em><a href="/order/RF83297">83297</a>
I’m glad someone discussed this. I tried solving the same problem in a cross browser way as a jQuery plugin.
You can look at my code here: http://github.com/garyharan/jquery-replace-utilities/
Great article. Harsh problems.. I still like to add that pre-planning may reduce some of these issues – but of course there is also “the client wish” factor; whether the client is your boss, your customer or yourself….
@Jacob, Neighbouring text nodes isn’t common, but it is something I feel would need to be included in a catch-all solution. The fact that you’re reverting the elements to their original state at all is definitely a good thing! Most of what I covered were edge cases, and I think that you should only consider each one on its own — IMO it may not be necessary to accommodate all eventualities.
@Rich, I hadn’t considered that this would be a common problem, but you reminded me that there are quite a few WYSIWYG editors that would have to to build in some kind of mechanism for these edge-cases. CKEditor’s solution is interesting, although I’d probably use just one anchor and nest the
em
within it instead of using two separate anchors.@Gary, I was working on something similar when I encountered all of these problems and decided to write the post — never ended up publishing the plugin though. Liking your usage of the awesome document fragment btw!
@Can, yup, well, I think each problem has to be considered on its own — it simply may not be worth covering all bases…
James, great article! I actually created a jQuery plugin last year called jQuery replaceText that does pretty much what you’ve covered here, along with an explanation of the general process and caveats, so please check it out when you get a chance!
http://benalman.com/projects/jquery-replacetext-plugin/
The line:
should rather be:
Because according to https://developer.mozilla.org/En/DOM/Node.insertBefore the first parameter has to be the node that should be inserted.
Thanks!