HEX

File: //lib/python3/dist-packages/ocrmypdf/__pycache__/_pipeline.cpython-38.pyc
U

��Z^Av�@sddlZddlZddlZddlmZmZddlmZddlmZddl	Z	ddl
Z
ddlmZddl
mZddlmZdd	lmZdd
lmZddlmZmZmZmZmZddlmZmZdd
lmZddl m!Z!ddl"m"Z"ddl#m$Z$ddl%m&Z&m'Z'm(Z(dZ)dd�Z*dTdd�Z+dd�Z,dUdd�Z-dd�Z.dd �Z/d!d"�Z0d#d$�Z1d%d&�Z2d'd(�Z3d)d*�Z4d+d,�Z5dVd.d/�Z6d0d1�Z7d2d3�Z8d4d5�Z9d6d7�Z:d8d9�Z;d:d;�Z<d<d=�Z=d>d?�Z>d@dA�Z?dBdC�Z@dDdE�ZAdFdG�ZBdHdI�ZCdJdK�ZDdLdM�ZEdNdO�ZFdPdQ�ZGdRdS�ZHdS)W�N)�datetime�timezone)�Path)�copyfileobj)�encode_pdf_date)�Image�)�	leptonica)�PROGRAM_NAME)�__version__)�DpiError�EncryptedPdfError�InputFileError�PriorOcrFoundError�UnsupportedImageFormatError)�ghostscript�	tesseract)�safe_symlink)�
HocrTransform)�optimize)�generate_pdfa_ps)�
Colorspace�Encoding�PdfInfoi�c
Cs�|�d�zt�|�}WnBtk
rZ}z$|�t|��||j��t�|�W5d}~XYnX|��|�d�d|jkr�|jddkr�|j	s�|�d|j
�|�d|jd�|�d�t��n&|j	s�|�d|j
�|�d�t��|jd	k�r|�d
�t��d|jk�rB|jdk�r&|�d
�n|jdk�rB|�d�t��W5QRXz`|�d�t
j}|j	�rxt
�|j	|j	f�}t|d��}t
j||d|d�W5QRX|�d�Wn8t
jk
�r�}z|�|�t�|�W5d}~XYnXdS)Nz6Input file is not a PDF, checking if it is an image...zInput file is an image�dpi)�`rzImage size: (%d, %d)zImage resolution: (%d, %d)z�Input file is an image, but the resolution (DPI) is not credible.  Estimate the resolution at which the image was scanned and specify it using --image-dpi.z�Input file is an image, but has no resolution (DPI) in its metadata.  Estimate the resolution at which image was scanned and specify it using --image-dpi.)ZRGBAZLAzEThe input image has an alpha channel. Remove the alpha channel first.Z
iccprofileZRGBz-Input image has no ICC profile, assuming sRGBZCMYKz/Input CMYK image has no ICC profile, not usablez+Image seems valid. Try converting to PDF...�wbF)�
layout_fun�
with_pdfrw�outputstreamz,Successfully converted to PDF, processing...)�infor�open�EnvironmentError�error�str�replace�
input_filer�	image_dpi�sizer�mode�img2pdfZdefault_layout_fun�get_fixed_dpi_layout_fun�convertZImageOpenError)r&�output_file�options�log�im�erZoutf�r2�4/usr/lib/python3/dist-packages/ocrmypdf/_pipeline.py�triage_image_file2sf


���


��
r4�c	Cs>t|d��}|�|�}W5QRXt�d|�}|r:|�d�SdS)z�Try to find version signature at start of file.

    Not robust enough to deal with appended files.

    Returns empty string if not found, indicating file is probably not PDF.
    �rbs
%PDF-(\d\.\d)r�)r!�read�re�search�group)r&Z
search_window�fZ	signature�mr2r2r3�_pdf_guess_versionos
r>c
Cs�z,t|�r*|jr|�d�t||�|WSWnLtk
rx}z.|�d|���t|��||�}t|�|�W5d}~XYnXt	||||�|S)NzTArgument --image-dpi is being ignored because the input file is a PDF, not an image.zTemporary file was at: )
r>r'�warningrr"�debugr$r%rr4)Zoriginal_filenamer&r-r.r/r1�msgr2r2r3�triages�

rBFcCsLzt|||d�WStjk
r,t��Yntjk
rFt��YnXdS)N)�detailed_page_analysis�progbar)r�pikepdfZ
PasswordErrorr
ZPdfErrorr)r&rCrDr2r2r3�get_pdfinfo�s�
rFcCs�|j}|j}|j}|jr(|�d�t��|jrJ|j�d�rJ|�d�t��|j	r�|j
rh|�d�t��n|�d�|js�|�
d�dS)Nz~This PDF contains dynamic XFA forms created by Adobe LiveCycle Designer and can only be read by Adobe Acrobat or Adobe Reader.�pdfaa(This input file uses a PDF feature that is not supported by Ghostscript, so you cannot use --output-type=pdfa for this file. (Specifically, it uses the PDF-1.6 /UserUnit feature to support very large or small page sizes, and Ghostscript cannot output these files.)  Use --output-type=pdf instead.zVThis PDF has a user fillable form. --redo-ocr is not currently possible on such files.z_This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.z�Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.)r/�pdfinfor.Zneeds_renderingr#rZhas_userunit�output_type�
startswithZhas_acroform�redo_ocrr?�	force_ocrr )�contextr/rHr.r2r2r3�validate_pdfinfo_options�s4�����rNcCsTt|jp
t|jpd|jrtnd�}t|jp,t|jp4d|jr>tnd�}t|�t|�fS)z+Get the DPI when nonsquare DPI is tolerabler)�max�xres�VECTOR_PAGE_DPI�
oversample�
has_vector�yres�float)�pageinfor.rPrTr2r2r3�get_page_dpi�s��rWcCsP|jpd}|jpd}|jpd}tt||p,t||p6t|jr@tnd|jpJd��S)zBGet the DPI when we require xres == yres, scaled to physical unitsrr)rPrT�userunitrUrOrQrSrR)rVr.rPrTrXr2r2r3�get_page_square_dpi�s




��rYcCs.tt|jpt|jpt|jrtnd|jp(d��S)z=Get the DPI when we require xres == yres, in Postscript unitsr)rUrOrPrQrTrSrR)rVr.r2r2r3�get_canvas_square_dpi�s��rZcCsv|j}|j}|j}d}|jrH|j|jkrH|�d|j�d|j���d}n�|jr�|jsj|jsj|j	sjt
d��nR|jr�|�d�d}n<|j	r�|jr�|�
d�n
|�d�d}n|jr�|�d	�d}n\|j�s|j�s|jr�|jr�|�d
|j�d��n*|j�r|�
dt�d
��n|�d�d}|�rr|j�rr|j�rr|j|j}||jdk�rrd}|�
d|dd�d|jd�d��|S)NTzskipped z as requested by --pages Fz@page already has text! - aborting (use --force-ocr to force OCR)z@page already has text! - rasterizing text and running OCR anywayzYsome text on this page cannot be mapped to characters: consider using --force-ocr insteadzredoing OCRz$skipping all processing on this pagez$page has no images - rasterizing at z3 DPI because --force-ocr --oversample was specifiedz>page has no images - all vector content will be rasterized at za DPI, losing some resolution and likely increasing file size. Use --oversample to adjust the DPI.z�page has no images - skipping all processing on this page to avoid losing detail. Use --force-ocr if you wish to perform OCR on pages that have vector content.�@Bzpage too big, skipping OCR (z.1fz MPixels > z MPixels --skip-big))rVr.r/Zpages�pagenor@Zhas_textrLZ	skip_textrKrr Zhas_corrupt_text�warn�imagesZlossless_reconstructionrRrQZskip_bigZwidth_pixelsZ
height_pixels)�page_contextrVr.r/Zocr_requiredZpixel_countr2r2r3�is_ocr_required�s\�
�

�
���r`c
CsR|�d�}t|j|j�}t|j|j�}tj||||d|j||f|jjdd�|S)Nzrasterize_preview.jpgZjpeggrayr)rPrT�
raster_devicer/�page_dpir\)	�get_pathrZrVr.rYr�
rasterize_pdfr/r\)r&r_r-�
canvas_dpirbr2r2r3�rasterize_preview;s

�
rfcCs�ddddd�}dddd	d�}|jj}d
}|j|jjkrR|dkrLd||}qdd
}n|dkr`d}nd}d
}|dkr�d|�|d��d�}|d|�|jd���7}|�d|jd�d|��S)z=
    Describe the page rotation we are going to perform.
    u⇧u⇨u⇩u⇦)r�Z�i� u⬏u↻u⬑r7rzwill rotate zrotation appears correctzconfidence too low to rotatez	no changezwith existing rotation �?z, zpage is facing z
, confidence z.2fz - )rV�rotation�
confidencer.�rotate_pages_threshold�get�angle)r_�orient_conf�
correction�	directionZturnsZexisting_rotation�actionZfacingr2r2r3�describe_rotationLs rtcCs^tj||jj|jj|j|jjd�}|jd}|j�t	|||��|j
|jjkrZ|dkrZ|SdS)a�
    Work out orientation correct for each page.

    We ask Ghostscript to draw a preview page, which will rasterize with the
    current /Rotate applied, and then ask Tesseract which way the page is
    oriented. If the value of /Rotate is correct (e.g., a user already
    manually fixed rotation), then Tesseract will say the page is pointing
    up and the correction is zero. Otherwise, the orientation found by
    Tesseract represents the clockwise rotation, or the counterclockwise
    correction to rotation.

    When we draw the real page for OCR, we rotate it by the CCW correction,
    which points it (hopefully) upright. _graft.py takes care of the orienting
    the image and text layers.

    )�engine_mode�timeoutr/�
tesseract_envihr)rZget_orientationr.�
tesseract_oem�tesseract_timeoutr/rwror rtrlrm)Zpreviewr_rprqr2r2r3�get_orientation_correctionis�
��rzr7cs
ddddg�d�|dkr |jj}|�d|�d��}|j}��fdd	�}|jD]N}|jd
kr\qL|jdkrL|jtj	kr||d��qL|jtj
kr�|d��qL|d��qL|jr�|d����}	|j�
d|	���t||j�}
t||j�}tj|||
|
|	|j||f|jd||d
�
|S)NZpngmonoZpnggrayZpng256Zpng16mr�	rasterizez.pngcst���|��S�N)rO�index)Zcs�ZcolorspacesZ
device_idxr2r3�at_least�szrasterize.<locals>.at_least�imagerzRasterize with )rPrTrar/rbr\rkZ
filter_vector)r.�remove_vectorsrcrVr^Ztype_�bpcZcolorrr}ZgrayrSr/r@rZrYrrdr\)r&r_rqZ
output_tagr�r-rVrr�Zdevicererbr2r~r3r{�sF





�r{cCsDtdd�|jjD��r0|�d�}t�||�|S|j�d�|SdS)Ncss|]}|jdkVqdS)rN)r�)�.0r�r2r2r3�	<genexpr>�sz/preprocess_remove_background.<locals>.<genexpr>zpp_rm_bg.pngz'background removal skipped on mono page)�anyrVr^rcr	Zremove_backgroundr/r )r&r_r-r2r2r3�preprocess_remove_background�s
r�cCs*|�d�}t|j|j�}t�|||�|S)Nz
pp_deskew.png)rcrYrVr.r	Zdeskew)r&r_r-rr2r2r3�preprocess_deskew�s
r�cCs@ddlm}|�d�}t|j|j�}|�||||j|jj�|S)Nr)�unpaperzpp_clean.png)	�execr�rcrYrVr.Zcleanr/Zunpaper_args)r&r_r�r-rr2r2r3�preprocess_clean�s
�r�c	Csx|�d�}|j}t�|���R}ddlm}ddlm}|�d|j�}|�|�}|j	d\}	}
|j
�d|	|
f�|j�s&d}|j
r�d	}|jj|dd
�D]�}dd�|D�}
t|	�d
t|
�d
}}|
d||j|
d||
d||j|
d|g}dd�|D�}|j
�d|�|j||d�q�|j�rJtj�|�}|��}|��}~t|	�t|
�f}|j||d�W5QRX|S)z�Create the image we send for OCR. May not be the same as the display
    image depending on preprocessing. This image will never be shown to the
    user.zocr.pngr)�
ImageColor)�	ImageDrawz#ffffffrzresolution %r %rNT)ZvisibleZcorruptcSsg|]}t|��qSr2)rU)r��vr2r2r3�
<listcomp>sz$create_ocr_image.<locals>.<listcomp>gR@��rcSsg|]}tt|���qSr2)�int�round)r��cr2r2r3r�
szblanking %r)Zfill)r)rcr.rr!�PILr�r�Zgetcolorr)r r/r@rLrKrVZ
get_textareasrUZheightZ	rectangleZ	thresholdr	ZPixZfrompilZ#masked_threshold_on_background_normZtopilr��save)r�r_r-r.r0r�r�ZwhiteZdrawrPrT�maskZtextareaZbboxZxscaleZyscaleZ	pixcoordsZpixrr2r2r3�create_ocr_image�sF

�


�r�c
CsX|�d�}|�d�}|j}tj|||g|j|j|j|j|j|j	|j
|j|jd�||fS)Nz
ocr_hocr.hocrzocr_hocr.txt)r&Zoutput_files�languageru�
tessconfigrv�pagesegmode�
user_words�
user_patternsrwr/)
rcr.rZ
generate_hocrr�rx�tesseract_configry�tesseract_pagesegmoder�r�rwr/)r&r_Zhocr_outZ
hocr_text_outr.r2r2r3�ocr_tesseract_hocrs"

�
r�cCs|jotdd�|jD��S)Ncss|]}|jtjkVqdSr|)�encrZjpeg)r�r0r2r2r3r�1sz4should_visible_page_image_use_jpg.<locals>.<genexpr>)r^�all)rVr2r2r3�!should_visible_page_image_use_jpg/sr�c	Csl|�d�}t�|��N}t|j|j�}|j�d||f�}t|d�t|d�f}|j	|d|d�W5QRX|S)Nzvisible.jpgrrrZJPEG)�formatr)
rcrr!rYrVr.r rnr�r�)r�r_r-r0Zfallback_dpirr2r2r3�create_visible_page_jpg4s
r�c
Cs�|�d�}t|j|j�}t�||f�}t|d��F}t|d��0}|j�d�tj	|d||d�|j�d�W5QRXW5QRX|S)Nzvisible.pdfr6rr,F)rrrzconvert done)
rcrYrVr.r*r+r!r/r@r,)r�r_r-rrZimfile�pdfr2r2r3�create_pdf_page_from_imageEs
� r�cCs:|�d�}t|j|j�}t||�}|j|ddddd�|S)Nzocr_hocr.pdfFT)Z
imageFileNameZshowBoundingboxesZ
invisibleTextZinterwordSpaces)rcrYrVr.rZto_pdf)Zhocrr_r-r�
hocrtransformr2r2r3�render_hocr_pageZs

�r�cCsZ|�d�}|�d�}|j}tj|d|||j|jd|j|j|j|j	|j
|j|jd�||fS)Nzocr_tess.pdfzocr_tess.txtT)�input_imageZskip_pdf�
output_pdf�output_textr�ruZ	text_onlyr�rvr�r�r�rwr/)
rcr.rZgenerate_pdfr�rxr�ryr�r�r�rwr/)r�r_r�r�r.r2r2r3�ocr_tesseract_textonly_pdfhs(

�r�cs��fdd���fdd�dD�}d}|dk	rx|jr:|j|d<|jrJ|j|d<|jrZ|j|d	<|jrj|j|d
<|jdkrxd}t�d
t�d|�d
t����|d<dt	j
��|d<dtjkr�tjd|d<dtjkr�tjd|d<t
t�tj��|d<|S)Nc	s4z�j|}t|�WSttfk
r.YdSXdS)Nr7)�docinfor$�KeyError�	TypeError)�key�s)�base_pdfr2r3�from_document_info�s


z'get_docinfo.<locals>.from_document_infocsi|]}|�|��qSr2r2)r��k)r�r2r3�
<dictcomp>�s�zget_docinfo.<locals>.<dictcomp>)�/Title�/Author�	/Keywords�/Subjectz
/CreationDateZOCRr�r�r�r�ZsandwichzOCR-PDFriz
 / Tesseract z/Creatorzpikepdf z	/ProducerZOCRMYPDF_CREATORZOCRMYPDF_PRODUCERz/ModDate)�titleZauthor�keywordsZsubjectZpdf_rendererr
�VERSIONr�versionrEr�os�environrrZnowrZutc)r�r.ZpdfmarkZrenderer_tagr2)r�r�r3�get_docinfos2
�




�

r�cCs|�d�}t|�|S)Nzpdfa.ps)rcr)rMr-r2r2r3�generate_postscript_stub�s
r�c	Cs�|j}|j}|�d�}|�d�}d}t�|��`}|jrp|j��D].\}	}
dt|
�kr@t|
��dd�|j|	<d}q@|r�|�	|�n
t
||�W5QRXtj|j
||g||j|j|jdd�|S)	Nzfix_docinfo.pdfzpdfa.pdfF��T���)Zpdf_versionZ	pdf_pagesr-Zcompressionr/Z	pdfa_part)r.rHrcrEr!r��items�bytesr%r�rrZ
generate_pdfaZmin_versionZpdfa_image_compressionr/rI)Z	input_pdfZ
input_ps_stubrMr.Z
input_pdfinfoZfix_docinfo_filer-ZmodifiedZpdf_filer�r�r2r2r3�convert_to_pdfa�s.

�	r�cCs$t�|�j}||jjdkr dSdS)Nr[TF)r��stat�st_sizer.Z
fast_web_view)�working_filerMZfilesizer2r2r3�should_linearize�sr�c
s���d�}�j���fdd�}t��j���}t�|���}t|��}|���V}|j|ddd�d|krt|�dd�|d<|��}t	|�
��t	|�
��}	||	�W5QRX|j|d	d	tjj
�jd
kr�t|��ndd�W5QRXW5QRX|S)Nzmetafix.pdfcsN|sdS�j�d�r0�j�d��j�d|�n�j�d��j�d|�dS)NrGz�Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.z1The following metadata fields were not copied: %rz^Some input metadata could not be copied.You may wish to examine the output PDF's XMP metadata.)rIrJr/r?r@r#r )�missing�rMr.r2r3�report_on_metadata�s"����z*metadata_fixup.<locals>.report_on_metadataF)Zdelete_missingZ
raise_failurezxmp:CreateDatezxmp:ModifyDater7Tr�Zcompress_streamsZ
preserve_pdfaZobject_stream_modeZ	linearize)rcr.rEr!�originr�Z
open_metadataZload_from_docinforn�set�keysr��ObjectStreamMode�generaterr�)
r�rMr-r�Zoriginalr�r��metaZ
meta_originalr�r2r�r3�metadata_fixup�s,


��r�cCs6|�d�}tddtjjt||�d�}t||||�|S)Nzoptimize.pdfTr�)rc�dictrEr�r�r�r)r&rMr-Z
save_settingsr2r2r3�optimize_pdfs
�r�cCs�|�d�}t|ddd���}t|�D]�\}}|dkr<|�d�|r�t|ddd��6}|��}|�d�rv|�|dd��n
|�|�W5QRXq"|�d	|d
�d��q"W5QRX|S)Nzsidecar.txt�wzutf-8)�encodingr��rr�z[OCR skipped on page r�])rcr!�	enumerate�writer8�endswith)Z	txt_filesrMr-�streamZpage_numZtxt_fileZin_Ztxtr2r2r3�merge_sidecarss


"r�c
Csl|j�d||�t|d��H}|dkr>t|tjj�tj��n t|d��}t||�W5QRXW5QRXdS)Nz%s -> %sr6�-r)r/r@r!r�sys�stdout�buffer�flush)r&r-rMZinput_streamZ
output_streamr2r2r3�
copy_final.sr�)r5)FF)rr7N)Ir�r9r�rr�pathlibrZshutilrr*rEZpikepdf.models.metadatarr�rr7r	Z_versionr
rr��
exceptionsrr
rrrr�rrZhelpersrr�rrrGrrHrrrrQr4r>rBrFrNrWrYrZr`rfrtrzr{r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r�r2r2r2r3�<module>sl=

)J&�
5
8'#5