javascript - Extract text from HTML while preserving block-level element newlines

Question

Welcome To Ask or Share your Answers For Others

javascript - Extract text from HTML while preserving block-level element newlines

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - Extract text from HTML while preserving block-level element newlines

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text();

While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).

Problem

Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.

A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.

I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.

Question

What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?

Example

Consider:

Select and copy this entire question.
Open the textarea example page.
Paste the content into the textarea.

The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.

To further clarify, given any HTML content, such as:

   <h1>Header</h1>
   <p>Paragraph</p>
   <ul>
     <li>First</li>
     <li>Second</li>
   </ul>
   <dl>
     <dt>Term</dt>
       <dd>Definition</dd>
   </dl>
   <div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div>

How would you produce:

  Header
  Paragraph

    First
    Second

  Term
    Definition

  Div with span.
  After the break.

Note: Neither indentation nor non-normalized whitespace are relevant.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:34:48+0000

Consider:

/**
 * Returns the style for a node.
 *
 * @param n The node to check.
 * @param p The property to retrieve (usually 'display').
 * @link http://www.quirksmode.org/dom/getstyles.html
 */
this.getStyle = function( n, p ) {
  return n.currentStyle ?
    n.currentStyle[p] :
    document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}

/**
 * Converts HTML to text, preserving semantic newlines for block-level
 * elements.
 *
 * @param node - The HTML node to perform text extraction.
 */
this.toText = function( node ) {
  var result = '';

  if( node.nodeType == document.TEXT_NODE ) {
    // Replace repeated spaces, newlines, and tabs with a single space.
    result = node.nodeValue.replace( /s+/g, ' ' );
  }
  else {
    for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
      result += _this.toText( node.childNodes[i] );
    }

    var d = _this.getStyle( node, 'display' );

    if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
        node.tagName == 'BR' || node.tagName == 'HR' ) {
      result += '
';
    }
  }

  return result;
}

http://jsfiddle.net/3mzrV/2/

That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.

Categories

javascript - Extract text from HTML while preserving block-level element newlines

javascript - Extract text from HTML while preserving block-level element newlines

Background

Problem

Question

Example

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags