myst_parser.parsers.parse_html#

A simple but complete HTML to Abstract Syntax Tree (AST) parser.

The AST can also reproduce the HTML text.

Example:

>> text = '<div class="note"><p>text</p></div>'
>> ast = tokenize_html(text)
>> list(ast.walk(include_self=True))
[Root(''), Tag('div', {'class': 'note'}), Tag('p'), Data('text')]
>> str(ast)
'<div class="note"><p>text</p></div>'
>> str(ast[0][0])
'<p>text</p>'

Note: optional tags are not accounted for (see https://html.spec.whatwg.org/multipage/syntax.html#optional-tags)

1.  Module Contents#

1.1.  Classes#

Attribute

This class holds the tags’s attributes.

Element

An Element of the xml/html document.

Root

The root of the AST tree.

Tag

Represent xml/html tags under the form: <name key=”value” …> … </name>.

XTag

Represent XHTML style tags with no children, like <img src=”t.gif” />

VoidTag

Represent tags with no children, only start tag, like <img src=”t.gif” >

TerminalElement

Data

Represent data inside xml/html documents, like raw text.

Declaration

Represent declarations, like <!DOCTYPE html>

Comment

Represent HTML comments

Pi

Represent processing instructions like <?xml-stylesheet ?>

Char

Represent character codes like: &#0

Entity

Represent entities like &amp

Tree

The engine class to generate the AST tree.

HtmlToAst

The tokenizer class.

1.2.  Functions#

1.3.  API#

class myst_parser.parsers.parse_html.Attribute[source]#

Bases: dict

This class holds the tags’s attributes.

Initialization

Initialize self. See help(type(self)) for accurate signature.

property classes: list[str]#

Return ‘class’ attribute as list.

class myst_parser.parsers.parse_html.Element(name: str = '', attr: dict | None = None)[source]#

Bases: collections.abc.MutableSequence

An Element of the xml/html document.

All xml/html entities inherit from this class.

Initialization

Initialise the element.

property parent: myst_parser.parsers.parse_html.Element | None#

Return parent.

property children: list[myst_parser.parsers.parse_html.Element]#

Return copy of children.

reset_children(children: list[myst_parser.parsers.parse_html.Element], deepcopy: bool = False)[source]#
insert(index: int, item: myst_parser.parsers.parse_html.Element)[source]#
deepcopy() myst_parser.parsers.parse_html.Element[source]#

Recursively copy and remove parent.

abstractmethod render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#

Returns a HTML string representation of the element.

Parameters:

tag_overrides – Provide a dictionary of render function for specific tag names, to override the normal render format

walk(include_self: bool = False) collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#

Walk through the xml/html AST.

strip(inplace: bool = False, recurse: bool = False) myst_parser.parsers.parse_html.Element[source]#

Return copy with all Data tokens that only contain whitespace / newlines removed.

find(identifier: str | type[myst_parser.parsers.parse_html.Element], attrs: dict | None = None, classes: collections.abc.Iterable[str] | None = None, include_self: bool = False, recurse: bool = True) collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#

Find all elements that match name and specific attributes.

class myst_parser.parsers.parse_html.Root(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

The root of the AST tree.

Initialization

Initialise the element.

render(**kwargs) str[source]#

Returns a string HTML representation of the structure.

class myst_parser.parsers.parse_html.Tag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent xml/html tags under the form: <name key=”value” …> … </name>.

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#
class myst_parser.parsers.parse_html.XTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent XHTML style tags with no children, like <img src=”t.gif” />

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) str[source]#
class myst_parser.parsers.parse_html.VoidTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent tags with no children, only start tag, like <img src=”t.gif” >

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.TerminalElement(data: str)[source]#

Bases: myst_parser.parsers.parse_html.Element

deepcopy() myst_parser.parsers.parse_html.TerminalElement[source]#

Copy and remove parent.

class myst_parser.parsers.parse_html.Data(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent data inside xml/html documents, like raw text.

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Declaration(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent declarations, like <!DOCTYPE html>

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Comment(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent HTML comments

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Pi(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent processing instructions like <?xml-stylesheet ?>

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Char(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent character codes like: &#0

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Entity(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent entities like &amp

Initialization

Initialise the element.

render(**kwargs) str[source]#
class myst_parser.parsers.parse_html.Tree(name: str = '')[source]#

The engine class to generate the AST tree.

Initialization

Initialise Tree

clear()[source]#

Clear the outmost and stack for a new parsing.

last() myst_parser.parsers.parse_html.Element[source]#

Return the last pointer which point to the actual tag scope.

nest_tag(name: str, attrs: dict)[source]#

Nest a given tag at the bottom of the tree using the last stack’s pointer.

nest_xtag(name: str, attrs: dict)[source]#

Nest an XTag onto the tree.

nest_vtag(name: str, attrs: dict)[source]#

Nest a VoidTag onto the tree.

nest_terminal(klass: type[myst_parser.parsers.parse_html.TerminalElement], data: str)[source]#

Nest the data onto the tree.

enclose(name: str)[source]#

When a closing tag is found, pop the pointer’s scope from the stack, to then point to the earlier scope’s tag.

class myst_parser.parsers.parse_html.HtmlToAst(name: str = '', convert_charrefs: bool = False)[source]#

Bases: html.parser.HTMLParser

The tokenizer class.

Initialization

Initialize and reset this instance.

If convert_charrefs is true (the default), all character references are automatically converted to the corresponding Unicode characters.

If scripting is false (the default), the content of the noscript element is parsed normally; if it’s true, it’s returned as is without being parsed.

void_elements = None#
feed(source: str) myst_parser.parsers.parse_html.Root[source]#

Parse the source string.

handle_starttag(name: str, attr)[source]#

When found an opening tag then nest it onto the tree.

handle_startendtag(name: str, attr)[source]#

When found a XHTML tag style then nest it up to the tree.

handle_endtag(name: str)[source]#

When found a closing tag then makes it point to the right scope.

handle_data(data: str)[source]#

Nest data onto the tree.

handle_decl(decl: str)[source]#
unknown_decl(decl: str)[source]#
handle_charref(data: str)[source]#
handle_entityref(data: str)[source]#
handle_pi(data: str)[source]#
handle_comment(data: str)[source]#
myst_parser.parsers.parse_html.tokenize_html(text: str, name: str = '', convert_charrefs: bool = False) myst_parser.parsers.parse_html.Root[source]#