File: //usr/lib/python3/dist-packages/lxml/html/__pycache__/clean.cpython-38.pyc
U
�>�afl � @ s� d Z ddlmZ ddlZddlZzddlmZ ddlmZ W n$ e k
r` ddl
mZmZ Y nX ddlmZ ddl
mZ dd l
mZmZ dd
l
mZmZ ze W n ek
r� eZY nX ze W n ek
r� eZY nX ze W n ek
�r eefZY nX ddd
ddddgZe�dejejB �jZe�dej�jZ ze�dej!�j"Z#W n" e$k
�rt e�d�j"Z#Y nX e�dej�j%Z&e�dej�j%Z'e�dej�j%Z(dd� Z)e�d�jZ*e�dejejB �Z+e�,d�Z-ej,ddeid�Z.G d d
� d
e/�Z0e0� Z1e1j2Z2e�d!ej�e�d"ej�gZ3d#d$d%d&d'd(gZ4e�d)ej�e�d*ej�e�d+�gZ5d,gZ6e3e4e5e6fd-d�Z7d.d/� Z8d0d� Z9e7j e9_ d$d#d%gZ:d1gZ;d2e:e;ed3�fd4d�Z<d5d� Z=d6d7� Z>e�d8ej�Z?d9d:� Z@dS );zcA cleanup tool for HTML.
Removes unwanted tags and content. See the `Cleaner` class for
details.
� )�absolute_importN)�urlsplit)�unquote_plus)r r )�etree)�defs)�
fromstring�XHTML_NAMESPACE)�
xhtml_to_html�_transform_result�
clean_html�clean�Cleaner�autolink�
autolink_html�
word_break�word_break_htmlzexpression\s*\(.*?\)z
@\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=z^:(javascript|jscript|livescript|vbscript|data|about|mocha):z (xml|svg)c C s8 d}t | �D ]}d}t|�r dS q|r,dS tt| ��S )NFT)�_find_image_dataurls�_is_unsafe_image_type�bool�_is_possibly_malicious_scheme)�sZis_image_urlZ
image_type� r �1/usr/lib/python3/dist-packages/lxml/html/clean.py�_is_javascript_schemeX s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z
namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ
dZdZdZ
dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed
ddd
gd
d
d
dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r
a
Instances cleans the document of each of the possible offending
elements. The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.
``scripts``:
Removes any ``<script>`` tags.
``javascript``:
Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
as they could contain Javascript.
``comments``:
Removes any comments.
``style``:
Removes any style tags.
``inline_style``
Removes any style attributes. Defaults to the value of the ``style`` option.
``links``:
Removes any ``<link>`` tags
``meta``:
Removes any ``<meta>`` tags
``page_structure``:
Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
``processing_instructions``:
Removes any processing instructions.
``embedded``:
Removes any embedded objects (flash, iframes)
``frames``:
Removes any frame-related tags
``forms``:
Removes any form tags
``annoying_tags``:
Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>``
``remove_tags``:
A list of tags to remove. Only the tags will be removed,
their content will get pulled up into the parent tag.
``kill_tags``:
A list of tags to kill. Killing also removes the tag's content,
i.e. the whole subtree, not just the tag itself.
``allow_tags``:
A list of tags to include (default include all).
``remove_unknown_tags``:
Remove any tags that aren't standard parts of HTML.
``safe_attrs_only``:
If true, only include 'safe' attributes (specifically the list
from the feedparser HTML sanitisation web site).
``safe_attrs``:
A set of attribute names to override the default list of attributes
considered 'safe' (when safe_attrs_only=True).
``add_nofollow``:
If true, then any <a> tags will have ``rel="nofollow"`` added to them.
``host_whitelist``:
A list or set of hosts that you can use for embedded content
(for content like ``<object>``, ``<link rel="stylesheet">``, etc).
You can also implement/override the method
``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
implement more complex rules for what can be embedded.
Anything that passes this test will be shown, regardless of
the value of (for instance) ``embedded``.
Note that this parameter might not work as intended if you do not
make the links absolute before doing the cleaning.
Note that you may also need to set ``whitelist_tags``.
``whitelist_tags``:
A set of tags that can be included with ``host_whitelist``.
The default is ``iframe`` and ``embed``; you may wish to
include other tags like ``script``, or you may want to
implement ``allow_embedded_url`` for more control. Set to None to
include all tags.
This modifies the document *in place*.
TFNr �iframe�embedc K sV |� � D ].\}}t| |�s*td||f ��t| ||� q| jd krRd|krR| j| _d S )NzUnknown parameter: %s=%r�inline_style)�items�hasattr� TypeError�setattrr �style)�self�kw�name�valuer r r �__init__� s
�zCleaner.__init__�src�href�code�object)�script�link�appletr r �layer�ac C s� t |d�r|�� }t|� |�d�D ]
}d|_q$| js@| �|� t| jpJd�}t| j pXd�}t| j
pfd�}| jrz|�d� | j
r�t| j�}|�tj�D ]&}|j}|�� D ]}||kr�||= q�q�| j�r
| j
r�| jtjk�s|�tj�D ](}|j}|�� D ]}|�d�r�||= q�q�|j| jdd� | j�s�t|�D ]P}|�d �} td
| �}
td
|
�}
| �|
��rh|jd = n|
| k�r0|�d |
� �q0| j�s
t|�d ��D ]p}|�dd
�� � �!� dk�r�|�"� �q�|j#�p�d
} td
| �}
td
|
�}
| �|
��r�d
|_#n|
| k�r�|
|_#�q�| j�s| j$�r&|�tj%� | j$�r:|�tj&� | j�rL|�d � | j�r`t�'|d � | j(�rt|�d� nP| j�s�| j�r�t|�d��D ]0}d|�dd
�� � k�r�| �)|��s�|�"� �q�| j*�r�|�d� | j+�r�|�,d� | j-�rZt|�d��D ]F}d}|�.� }|dk �r0|jdk�r0|�.� }�q|dk�r�|�"� �q�|�,d� |�,d� | j/�rn|�,tj0� | j1�r�|�d� |�,d� | j2�r�|�,d� g }
g }|�� D ]T}|j|k�r�| �)|��r̐q�|�3|� n&|j|k�r�| �)|��r�q�|
�3|� �q�|
�r2|
d |k�r2|
�4d�}d|_|j�5� n8|�rj|d |k�rj|�4d�}|jdk�rbd|_|�5� |�6� |D ]}|�"� �qv|
D ]}|�7� �q�| j8�r�|�r�t9d��ttj:�}|�r,g }|�� D ]}|j|k�r�|�3|� �q�|�r,|d |k�r|�4d�}d|_|j�5� |D ]}|�7� �q| j;�r�t<|�D ]X}| �=|��s<|�d�}|�r�d|k�rxd d!| k�rx�q<d"| }nd}|�d|� �q<dS )#z&
Cleans the document.
�getrootZimageZimgr r, ZonF)Zresolve_base_hrefr"