Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
346 views
in Technique[技术] by (71.8m points)

javascript - Extract text from HTML while preserving block-level element newlines

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text();

While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).

Problem

Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.

A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.

I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.

Question

What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?

Example

Consider:

  1. Select and copy this entire question.
  2. Open the textarea example page.
  3. Paste the content into the textarea.

The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.

To further clarify, given any HTML content, such as:

   <h1>Header</h1>
   <p>Paragraph</p>
   <ul>
     <li>First</li>
     <li>Second</li>
   </ul>
   <dl>
     <dt>Term</dt>
       <dd>Definition</dd>
   </dl>
   <div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div>

How would you produce:

  Header
  Paragraph

    First
    Second

  Term
    Definition

  Div with span.
  After the break.

Note: Neither indentation nor non-normalized whitespace are relevant.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Consider:

/**
 * Returns the style for a node.
 *
 * @param n The node to check.
 * @param p The property to retrieve (usually 'display').
 * @link http://www.quirksmode.org/dom/getstyles.html
 */
this.getStyle = function( n, p ) {
  return n.currentStyle ?
    n.currentStyle[p] :
    document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}

/**
 * Converts HTML to text, preserving semantic newlines for block-level
 * elements.
 *
 * @param node - The HTML node to perform text extraction.
 */
this.toText = function( node ) {
  var result = '';

  if( node.nodeType == document.TEXT_NODE ) {
    // Replace repeated spaces, newlines, and tabs with a single space.
    result = node.nodeValue.replace( /s+/g, ' ' );
  }
  else {
    for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
      result += _this.toText( node.childNodes[i] );
    }

    var d = _this.getStyle( node, 'display' );

    if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
        node.tagName == 'BR' || node.tagName == 'HR' ) {
      result += '
';
    }
  }

  return result;
}

http://jsfiddle.net/3mzrV/2/

That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...