Конвертируем PDF файл

Набор утилит командной строки poppler-utils для работы с pdf файлами входит в большинство дистрибутивов linux.

Посмотрим статус этого пакета, выполнив команду apt show poppler-utils или

dpkg -s poppler-utils
Package: poppler-utils
Status: install ok installed
Priority: optional
Section: utils
Installed-Size: 599
Maintainer: Ubuntu Developers <Адрес электронной почты защищен от спам-ботов. Для просмотра адреса в вашем браузере должен быть включен Javascript.>
Architecture: amd64
Multi-Arch: foreign
Source: poppler
Version: 0.62.0-2ubuntu2.10
Replaces: pdftohtml, xpdf-reader, xpdf-utils (<< 3.02-2~)
Provides: pdftohtml, xpdf-utils
Depends: libpoppler73 (= 0.62.0-2ubuntu2.10), libc6 (>= 2.14), libcairo2 (>= 1.12.0), libfreetype6 (>= 2.2.1), liblcms2-2 (>= 2.2+git20110628), libstdc++6 (>= 5.2)
Breaks: xpdf-common, xpdf-utils (<< 1:0)
Conflicts: pdftohtml
Description: PDF utilities (based on Poppler)
 Poppler is a PDF rendering library based on Xpdf PDF viewer.
 .
 This package contains command line utilities (based on Poppler) for getting
 information of PDF documents, convert them to other formats, or manipulate
 them:
  * pdfdetach -- lists or extracts embedded files (attachments)
  * pdffonts -- font analyzer
  * pdfimages -- image extractor
  * pdfinfo -- document information
  * pdfseparate -- page extraction tool
  * pdfsig -- verifies digital signatures
  * pdftocairo -- PDF to PNG/JPEG/PDF/PS/EPS/SVG converter using Cairo
  * pdftohtml -- PDF to HTML converter
  * pdftoppm -- PDF to PPM/PNG/JPEG image converter
  * pdftops -- PDF to PostScript (PS) converter
  * pdftotext -- text extraction
  * pdfunite -- document merging tool
Homepage: http://poppler.freedesktop.org/
Original-Maintainer: Debian freedesktop.org maintainers <Адрес электронной почты защищен от спам-ботов. Для просмотра адреса в вашем браузере должен быть включен Javascript.>

Пакет оказался установленным и содержит 12 утилит для работы с pdf файлом.

Например, мне нужно извлечь одну страницу из файла 74HC_HCT595.pdf, datasheet на микросхему 74HC595, содержащую функциональную схему регистра сдвига. Для выполнения этой задачи нам понадобится page extraction tool (pdfseparate) из пакета poppler-utils. Сначала разберёмся как работает эта программа.

pdfseparate -h
pdfseparate version 0.62.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfseparate [options] <PDF-sourcefile> <PDF-pattern-destfile>
  -f <int>       : first page to extract
  -l <int>       : last page to extract
  -v             : print copyright and version info
  -h             : print usage information

А теперь извлекаем нужную нам страницу:

pdfseparate -f 2 -l 2 74HC_HCT595.pdf page%d.pdf

Или, мне нужно конвертировать файл ATtiny48-88.pdf, datasheet на микроконтроллер ATtiny88, для дальнейшего автоматического перевода в браузере. Для выполнения этой задачи нам понадобится pdftohtml конвертер из пакета poppler-utils. Сначала разберёмся как работает эта программа.

pdftohtml -h
pdftohtml version 0.62.0
Copyright 2005-2017 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
  -f <int>              : first page to convert
  -l <int>              : last page to convert
  -q                    : don't print any messages or errors
  -h                    : print usage information
  -?                    : print usage information
  -help                 : print usage information
  --help                : print usage information
  -p                    : exchange .pdf links by .html
  -c                    : generate complex document
  -s                    : generate single document that includes all pages
  -i                    : ignore images
  -noframes             : generate no frames
  -stdout               : use standard output
  -zoom <fp>            : zoom the pdf document (default 1.5)
  -xml                  : output for XML post-processing
  -hidden               : output hidden text
  -nomerge              : do not merge paragraphs
  -enc <string>         : output text encoding name
  -fmt <string>         : image file format for Splash output (png or jpg)
  -v                    : print copyright and version info
  -opw <string>         : owner password (for encrypted files)
  -upw <string>         : user password (for encrypted files)
  -nodrm                : override document DRM settings
  -wbt <fp>             : word break threshold (default 10 percent)
  -fontfullname         : outputs font full name

А теперь конвертируем, html один файл, картинки в формате png каждую в отдельный файл. Делаем html без фреймов. Сохраняем ссылки внутри документа. Изменяем масштаб, увеличиваем в 2 раза

pdftohtml -p -c -noframes -fmt png -zoom 2 ATtiny48-88.pdf pict.html