File: //lib/python3/dist-packages/ocrmypdf/__pycache__/hocrtransform.cpython-38.pyc
U
��Z^�6 � @ s6 d dl Z d dlZd dlmZ d dlmZmZmZ d dlm Z d dl
mZ d dlm
Z
eddd d
dg�ZG dd
� d
e�ZG dd� d�Zedk�r2e jdd�Zejdddddd� ejddeddd� ejdddd d!� ejd"ddd#d� ejd$d%d&� ejd'd(d&� e�� Zeejej�Zejejejejej d)� dS )*� N)�
namedtuple)�atan�cos�sin)�ElementTree)�inch)�Canvas�Rect�x1�y1�x2�y2c @ s e Zd ZdS )�HocrTransformErrorN)�__name__�
__module__�__qualname__� r r �8/usr/lib/python3/dist-packages/ocrmypdf/hocrtransform.pyr + s r c @ s� e Zd ZdZe�d�Ze�dej�Ze �
dddddd ��Zd
d� Zdd
� Z
dd� Zedd� �Zedd� �Zdd� Zedd� �Zd!dd�Zedd� �Zdd � ZdS )"�
HocrTransformz�
A class for converting documents from the hOCR format.
For details of the hOCR format, see:
http://kba.cloud/hocr-spec/
zbbox((\s+\d+){4})zs
baseline \s+
([\-\+]?\d*\.?\d*) \s+ # +/- decimal float
([\-\+]?\d+) # +/- intZffu ffiu fflZfiZfl)u ffu ffiu fflu fiu flc C s� || _ t�|�| _t�d| j�� j�}d| _|r<|� d�| _d\| _
| _| j�d| j �D ]8}| �
|�}| �|�}|j|j | _
|j|j | _ q�qZ| j
d ks�| jd kr�td��d S )Nz
({.*})html� � )NNz.//%sdiv[@class='ocr_page']z$hocr file is missing page dimensions)�dpir �parse�hocr�re�matchZgetroot�tag�xmlns�group�width�height�findall�element_coordinates�
pt_from_pixelr r
r
r r )�selfZhocrFileNamer �matchesZdiv�coordsZ pt_coordsr r r �__init__C s
zHocrTransform.__init__c C s6 | j dkrdS | j �d| j �}|r.| �|�S dS dS )z=
Return the textual content of the HTML body
Nr z .//%sbody)r �findr �_get_element_text)r$ Zbodyr r r �__str__[ s
zHocrTransform.__str__c C sL d}|j dk r||j 7 }|�� D ]}|| �|�7 }q |jdk rH||j7 }|S )zL
Return the textual content of the element and its children
r N)�textZgetchildrenr) �tail)r$ �elementr+ Zchildr r r r) g s
zHocrTransform._get_element_textc sR d}d|j krN| j�|j d �}|rN|�d��� � t�� fdd�td�D ��}|S )zj
Returns a tuple containing the coordinates of the bounding box around
an element
)r r r r �titler c 3 s | ]}t � | �V qd S �N)�int)�.0�n�r&