html5plus

This is a fork of html5lib to parse XML documents. For a pure HTML5 parser, please use html5lib instead.

Differences to html5lib

Basically, html5plus is amlost exactly the same as html5lib, except it is also able to parse simple XML documents:

  • Like XML, self-closing tags, such as <div/>, are handled as the leaf nodes (this is the only reason this fork exists).

For example,

<div/>
<div>foo</div>

will be interpreted as follows in html5plus.

<div></div>
<div>foo</div>

On the other hand, htm5lib and many browsers will interpret it as follows:

<div>
  <div>foo</div>
</div>
  • Support processing instructions (a pull request was sent to html5lib).
  • HtmlParser has an additional flag called cdataOK. It controls whether CDATA is always accepted, including the http://www.w3.org/1999/xhtml namespace.
  • Support the line number information (Node.lineNumber).
  • Notice that it is not available in Text node and it broke the compatibility with dart:html.

Installation

Add this to your pubspec.yaml (or create it):

dependencies:
  html5plus: any

Usage

###Parsing HTML is easy!

import 'package:html5plus/parser.dart' show parse;
import 'package:html5plus/dom.dart';

main() {
  var document = parse(
      '<body>Hello world! <a href="www.html5rocks.com">HTML5 rocks!');
  print(document.outerHtml);
}

###Parsing XML

import 'package:html5plus/parser.dart' show parse;
import 'package:html5plus/dom.dart';

main() {
  var document = new HtmlParser(lowercaseElementName: false, 
    lowercaseAttrName: false, cdataOK: true)
    .parse("""
      <!process this>
      <foo>Hello world! <important>XML rocks!</important>
        <![CDATA here & there ]]>
      </foo>
      """);

  for (final node in document.nodes)
    print("$node");
}

Libraries

dom

A simple tree API that results from parsing html. Intended to be compatible with dart:html, but right now it resembles the classic JS DOM.

dom_parsing

This library contains extra APIs that aren't in the DOM, but are useful when interacting with the parse tree.

parser

This library has a parser for HTML5 documents, that lets you parse HTML easily from a script or server side application:

parser_console

This library adds dart:io support to the HTML5 parser. Call initDartIOSupport before calling the parse methods and they will accept a RandomAccessFile as input, in addition to the other input types.