myst_parser.parsers.parse_html

`myst_parser.parsers.parse_html`#

A simple but complete HTML to Abstract Syntax Tree (AST) parser.

The AST can also reproduce the HTML text.

Example:

>> text = '<div class="note"><p>text</p></div>'
>> ast = tokenize_html(text)
>> list(ast.walk(include_self=True))
[Root(''), Tag('div', {'class': 'note'}), Tag('p'), Data('text')]
>> str(ast)
'<div class="note"><p>text</p></div>'
>> str(ast[0][0])
'<p>text</p>'

Note: optional tags are not accounted for (see https://html.spec.whatwg.org/multipage/syntax.html#optional-tags)

1. Module Contents#

1.1. Classes#

`Attribute`	This class holds the tags’s attributes.
`Element`	An Element of the xml/html document.
`Root`	The root of the AST tree.
`Tag`	Represent xml/html tags under the form: <name key=”value” …> … </name>.
`XTag`	Represent XHTML style tags with no children, like <img src=”t.gif” />
`VoidTag`	Represent tags with no children, only start tag, like <img src=”t.gif” >
`TerminalElement`
`Data`	Represent data inside xml/html documents, like raw text.
`Declaration`	Represent declarations, like <!DOCTYPE html>
`Comment`	Represent HTML comments
`Pi`	Represent processing instructions like <?xml-stylesheet ?>
`Char`	Represent character codes like: &#0
`Entity`	Represent entities like &amp
`Tree`	The engine class to generate the AST tree.
`HtmlToAst`	The tokenizer class.

1.2. Functions#

tokenize_html

1.3. API#

class myst_parser.parsers.parse_html.Attribute[source]#

Bases: dict

This class holds the tags’s attributes.

Initialization

Initialize self. See help(type(self)) for accurate signature.

property classes: list[str]#: Return ‘class’ attribute as list.

class myst_parser.parsers.parse_html.Element(name: str = '', attr: dict | None = None)[source]#

Bases: collections.abc.MutableSequence

An Element of the xml/html document.

All xml/html entities inherit from this class.

Initialization

Initialise the element.

property parent: myst_parser.parsers.parse_html.Element | None#: Return parent.

property children: list[myst_parser.parsers.parse_html.Element]#: Return copy of children.

reset_children(children: list[myst_parser.parsers.parse_html.Element], deepcopy: bool = False)[source]#

insert(index: int, item: myst_parser.parsers.parse_html.Element)[source]#

deepcopy() → myst_parser.parsers.parse_html.Element[source]#: Recursively copy and remove parent.

abstractmethod render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

Returns a HTML string representation of the element.

Parameters:: tag_overrides – Provide a dictionary of render function for specific tag names, to override the normal render format

walk(include_self: bool = False) → collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#: Walk through the xml/html AST.

strip(inplace: bool = False, recurse: bool = False) → myst_parser.parsers.parse_html.Element[source]#: Return copy with all Data tokens that only contain whitespace / newlines removed.

find(identifier: str | type[myst_parser.parsers.parse_html.Element], attrs: dict | None = None, classes: collections.abc.Iterable[str] | None = None, include_self: bool = False, recurse: bool = True) → collections.abc.Iterator[myst_parser.parsers.parse_html.Element][source]#: Find all elements that match name and specific attributes.

class myst_parser.parsers.parse_html.Root(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

The root of the AST tree.

Initialization

Initialise the element.

render(**kwargs) → str[source]#: Returns a string HTML representation of the structure.

class myst_parser.parsers.parse_html.Tag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent xml/html tags under the form: <name key=”value” …> … </name>.

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

class myst_parser.parsers.parse_html.XTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent XHTML style tags with no children, like <img src=”t.gif” />

Initialization

Initialise the element.

render(tag_overrides: dict[str, collections.abc.Callable[[myst_parser.parsers.parse_html.Element, dict], str]] | None = None, **kwargs) → str[source]#

class myst_parser.parsers.parse_html.VoidTag(name: str = '', attr: dict | None = None)[source]#

Bases: myst_parser.parsers.parse_html.Element

Represent tags with no children, only start tag, like <img src=”t.gif” >

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.TerminalElement(data: str)[source]#

Bases: myst_parser.parsers.parse_html.Element

deepcopy() → myst_parser.parsers.parse_html.TerminalElement[source]#: Copy and remove parent.

class myst_parser.parsers.parse_html.Data(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent data inside xml/html documents, like raw text.

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Declaration(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent declarations, like <!DOCTYPE html>

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Comment(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent HTML comments

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Pi(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent processing instructions like <?xml-stylesheet ?>

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Char(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent character codes like: &#0

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Entity(data: str)[source]#

Bases: myst_parser.parsers.parse_html.TerminalElement

Represent entities like &amp

Initialization

Initialise the element.

render(**kwargs) → str[source]#

class myst_parser.parsers.parse_html.Tree(name: str = '')[source]#

The engine class to generate the AST tree.

Initialization

Initialise Tree

clear()[source]#: Clear the outmost and stack for a new parsing.

last() → myst_parser.parsers.parse_html.Element[source]#: Return the last pointer which point to the actual tag scope.

nest_tag(name: str, attrs: dict)[source]#: Nest a given tag at the bottom of the tree using the last stack’s pointer.

nest_xtag(name: str, attrs: dict)[source]#: Nest an XTag onto the tree.

nest_vtag(name: str, attrs: dict)[source]#: Nest a VoidTag onto the tree.

nest_terminal(klass: type[myst_parser.parsers.parse_html.TerminalElement], data: str)[source]#: Nest the data onto the tree.

enclose(name: str)[source]#: When a closing tag is found, pop the pointer’s scope from the stack, to then point to the earlier scope’s tag.

class myst_parser.parsers.parse_html.HtmlToAst(name: str = '', convert_charrefs: bool = False)[source]#

Bases: html.parser.HTMLParser

The tokenizer class.

Initialization

Initialize and reset this instance.

If convert_charrefs is true (the default), all character references are automatically converted to the corresponding Unicode characters.

If scripting is false (the default), the content of the noscript element is parsed normally; if it’s true, it’s returned as is without being parsed.

void_elements = None#

feed(source: str) → myst_parser.parsers.parse_html.Root[source]#: Parse the source string.

handle_starttag(name: str, attr)[source]#: When found an opening tag then nest it onto the tree.

handle_startendtag(name: str, attr)[source]#: When found a XHTML tag style then nest it up to the tree.

handle_endtag(name: str)[source]#: When found a closing tag then makes it point to the right scope.

handle_data(data: str)[source]#: Nest data onto the tree.

handle_decl(decl: str)[source]#

unknown_decl(decl: str)[source]#

handle_charref(data: str)[source]#

handle_entityref(data: str)[source]#

handle_pi(data: str)[source]#

handle_comment(data: str)[source]#

myst_parser.parsers.parse_html.tokenize_html(text: str, name: str = '', convert_charrefs: bool = False) → myst_parser.parsers.parse_html.Root[source]#

myst_parser.parsers.parse_html

Contents

myst_parser.parsers.parse_html#

1. Module Contents#

1.1. Classes#

1.2. Functions#

1.3. API#

`myst_parser.parsers.parse_html`#