HEX
Server: Apache
System: Linux srv1.prosuiteplus.com 5.4.0-216-generic #236-Ubuntu SMP Fri Apr 11 19:53:21 UTC 2025 x86_64
User: prosuiteplus (1001)
PHP: 8.3.20
Disabled: NONE
Upload Files
File: //usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/__pycache__/ghosttext.cpython-38.pyc
U

��Z^�
�@s^ddlZddlZddlmmZddlmZe��Z	e�
dej�Zdd�Z
de	fdd�ZdS)	�N�)�ghostscripts�
    <char\b
    (?:   [^>]   # anything single character but >
        | \">\"  # special case: trap ">"
    )*
    />           # terminate with '/>'
cs<|�t|d�sgS��fdd���fdd�}dd�|�D�S)z.Get text boxes out of Ghostscript txtwrite xml�findallc3s���d�D]|}|jd}|jd}dd�|��D�}|dtt|�d�|d<t|�}|}|d�|d	|d
�|df}|Vq
dS)Nz.//span�bbox�sizecSsg|]}t|��qS�)�int)�.0Zptrr�</usr/lib/python3/dist-packages/ocrmypdf/pdfinfo/ghosttext.py�
<listcomp>4sz7page_get_textblocks.<locals>.blocks.<locals>.<listcomp>�g�?r�r)rZattrib�splitr�float�tuple)�spanZbbox_strZ	font_sizeZptsZbbox_topdownZbbZ
bbox_bottomup)�height�rootrr
�blocks0s

$z#page_get_textblocks.<locals>.blocksc3s�d}��D]�}|dkr|}|d|dkr�|d|dkr�|d|d}t|d|d�}||kr�|d|d|d|df}q
|V|}q
|dk	r�|VdS)Nrr
rr)�abs)�prevrZgapr)rrr
�
joined_blocks;s
 z*page_get_textblocks.<locals>.joined_blockscSsg|]}|�qSrr)r	�blockrrr
rMsz'page_get_textblocks.<locals>.<listcomp>)�hasattr)�infile�pagenoZxmltextrrr)rrrr
�page_get_textblocks)s
rc	
Cs�tj|dd�}t�d|�}zt�d|dg�}|�d�}WnHtjk
r�}z(|�d�|�|�dgt	|j
�}W5d}~XYnXt	|j
�t	|�}|dkr�|�d�|�d	t	|j
��d
t	|����|dkr�|�dg|�|S)N)r� s<document>
s</document>
Zpagez�An error occurred while attempting to retrieve existing text in the input file. Will attempt to continue assuming that there is no existing text in the file. The error was:rz6The number of pages in the input file is inconsistent.z	Expected z, txtwrite says )rZextract_text�regex_remove_char_tags�sub�ETZfromstringlistrZ
ParseError�error�lenZpages�extend)	rZpdfr�logZ
existing_textrZpage_xml�eZpage_count_differencerrr
�extract_text_xmlPs$�
"
 r&)Zlogging�reZxml.etree.ElementTreeZetreeZElementTreer �execrZ	getLoggerZgslog�compile�VERBOSErrr&rrrr
�<module>s�'