Elastic 进阶教程：生成离线 pdf 文档

Elastic 中国社区官方博客

于 2022-09-08 17:22:31 发布

阅读量1.4k

点赞数 3

CC 4.0 BY-SA版权

分类专栏： Elasticsearch Kibana Elastic 文章标签： elasticsearch java 大数据全文检索搜索引擎

原文链接：https://cloud.tencent.com/developer/article/2073442

Elastic 同时被 3 个专栏收录

1942 篇文章

订阅专栏

Elasticsearch

1314 篇文章

订阅专栏

Kibana

187 篇文章

订阅专栏

作者：点火三周。

原文：Elastic进阶教程：生成离线pdf文档 - 腾讯云开发者社区-腾讯云

前言

之前写过一篇如何生成离线官方文档的文章，但也有社区伙伴反馈说，是不是能够导出一个 pdf 格式的离线文档。

将 html 转换成pdf，网上有非常多的工具。但这个事情最大的难点在于：一份官方文档是以 book 的形式组织的。包含多个子页面，通过目录和链接进行跳转。而现有的工具只能将单页的 html 转换为 pdf。

以 elasticsearch 的官方文档为例，里面包含了7000多个子页面，根据目录，通过 url 进行跳转的方式。

因此，要想将所有内容都导出到一个 pdf 文件中，需要解决核心的问题是把原先文档的 book 的组织形式，变成一个“大宽表” —— 把内容都组织在一个页面上，才能够利用工具将其转换。

因此，整个任务拆分三个部分：

生成单页的官方文档
确保单页文档的格式和内容的正确
将单页文档变成 pdf

生成单页的官方文档

Elastic 的文档团队通过 build_docs 工具进行文档的构建：

git clone https://github.com/elastic/docs.git

可以通过阅读 README.md 了解该怎么使用这个软件。

HTML 格式文档的构建方法：

Elastic Stack 中，不同的软件、不同的版本，其文档的路径和依赖的资源有不同，因此，调用的命令也不一样。

关于各个产品文档的构建方式，可以参考项目里提供的这个构建脚本：docs/doc_build_aliases.sh at master · elastic/docs · GitHub。了解不同文档的构建方法，以 elasticsearch 为例：

#    export GIT_HOME="/<fullPathTYourRepos>"
#    source $GIT_HOME/docs/doc_build_aliases.sh
#

# Elasticsearch
alias docbldesx='$GIT_HOME/docs/build_docs --doc $GIT_HOME/elasticsearch/docs/reference/index.asciidoc --resource=$GIT_HOME/elasticsearch/x-pack/docs/ --chunk 1'

alias docbldes=docbldesx

# Elasticsearch 6.2 and earlier

alias docbldesold='$GIT_HOME/docs/build_docs --doc $GIT_HOME/elasticsearch/docs/reference/index.x.asciidoc --resource=$GIT_HOME/elasticsearch-extra/x-pack-elasticsearch/docs/ --chunk 1'

在这里，要正确构建 elasticsearch，需要提供 --resource 参数来指向一些存储于其他目录的资源，比如这里的 x-pack 目录。而6.2及之前的版本，由于目录组织形式，以及文件的差异，命令上有细微的差异。这里的 --chunk，表示的以多细的颗粒度进行 html 页面的拆分，1 表示以章节为单位进行拆分。

我们可以通过帮助命令得知：

INFO:build_docs:    Build local docs:
INFO:build_docs:
INFO:build_docs:        build_docs --doc path/to/index.asciidoc [opts]
INFO:build_docs:
INFO:build_docs:        Opts:
INFO:build_docs:          --chunk 1         Also chunk sections into separate files
INFO:build_docs:          --alternatives <source_lang>:<alternative_lang>:<dir>
INFO:build_docs:                            Examples in alternative languages.
INFO:build_docs:          --lang            Defaults to 'en'
INFO:build_docs:          --lenient         Ignore linking errors
INFO:build_docs:          --out dest/dir/   Defaults to ./html_docs.
INFO:build_docs:          --resource        Path to image dir - may be repeated
INFO:build_docs:          --respect_edit_url_overrides
INFO:build_docs:                            Respects `:edit_url:` overrides in the book.
INFO:build_docs:          --single          Generate a single HTML page, instead of
INFO:build_docs:                            a chunking into a file per chapter
INFO:build_docs:          --suppress_migration_warnings
INFO:build_docs:                            Suppress warnings about Asciidoctor migration
INFO:build_docs:                            issues. Use this when building "old" branches.
INFO:build_docs:          --toc             Include a TOC at the beginning of the page.
INFO:build_docs:          --private         Indicate that the github repo is private.
INFO:build_docs:        WARNING: Anything in the `out` dir will be deleted!
INFO:build_docs:
INFO:build_docs:    Build docs from all repos in conf.yaml:
INFO:build_docs:
INFO:build_docs:        build_docs --all [opts]
INFO:build_docs:
INFO:build_docs:        Opts:
INFO:build_docs:          --keep_hash       Build docs from the same commit hash as last time
INFO:build_docs:          --linkcheckonly   Skips the documentation builds. Checks links only.
INFO:build_docs:          --push            Commit the updated docs and push to origin
INFO:build_docs:          --announce_preview <host>
INFO:build_docs:                            Causes the build to log a line about where to find
INFO:build_docs:                            a preview of the build if anything is pushed.
INFO:build_docs:          --rebuild         Rebuild all branches of every book regardless of
INFO:build_docs:                            what has changed
INFO:build_docs:          --reference       Directory of `--mirror` clones to use as a
INFO:build_docs:                            local cache
INFO:build_docs:          --repos_cache     Directory to which working repositories are cloned.
INFO:build_docs:                            Defaults to `<script_dir>/.repos`.
INFO:build_docs:          --skiplinkcheck   Omit the step that checks for broken links
INFO:build_docs:          --sub_dir         Use a directory as a branch of some repo
INFO:build_docs:                            (eg --sub_dir elasticsearch:master:~/Code/elasticsearch)
INFO:build_docs:          --target_repo     Repository to which to commit docs
INFO:build_docs:          --target_branch   Branch to which to commit docs
INFO:build_docs:          --user            Specify which GitHub user to use, if not your own
INFO:build_docs:
INFO:build_docs:    General Opts:
INFO:build_docs:          --asciidoctor     Emit a happy message.
INFO:build_docs:          --conf <ymlfile>  Use your own configuration file, defaults to the
INFO:build_docs:                            bundled conf.yaml
INFO:build_docs:          --direct_html     Emit a happy message.
INFO:build_docs:          --in_standard_docker
INFO:build_docs:                            Specified by build_docs when running in
INFO:build_docs:                            its container
INFO:build_docs:          --open            Open the docs in a browser once built.
INFO:build_docs:          --procs           Number of processes to run in parallel, defaults
INFO:build_docs:                            to 3
INFO:build_docs:          --verbose         Output more logs

增加 --single 参数，我们可以将整个文档打包到单一的 HTML 文件当中。接下来我们将 elasticsearch 文档为例，选择性的生成一个 7.10 的文档。

获取官方文档原文

而我们需要编译的文档存在于各个项目中。

以 elasticsearch 为例：

地址：GitHub - elastic/elasticsearch: Free and Open, Distributed, RESTful Search Engine
路径：elasticsearch/docs/reference/

获取特定版本的官方文档

通过以下命令，获取 elasticsearch 的源码：

git clone https://github.com/elastic/elasticsearch.git
cd elasticsearch
# 获取正确的tag名称
git branch -a 
git checkout -b test remotes/origin/7.10

构建单页文档

通过以下命令构建:

./build_docs --doc /apps/elasticsearch/docs/reference/index.asciidoc --resource=/apps/elasticsearch/x-pack/docs/ --single --open

构建完成后，默认将在 html_docs 目录下生成 html 文件。如下，只有一个 index.html 文件：

/apps/docs/html_docs$ tree -L 2
.
└── raw
    ├── images
    ├── index.html
    ├── monitoring
    ├── security
    ├── setup
    └── snippets

该 index.html 文件有 13M 大小，包含了所有的页面，而对于图片和代码片段的一些引用，则分布在其他文件夹中：

drwxr-xr-x 15 lex lex 4.0K Aug 16 00:26 images
-rw-r--r--  1 lex lex  13M Aug 16 00:26 index.html
drwxr-xr-x  3 lex lex 4.0K Aug 16 00:26 monitoring
drwxr-xr-x  4 lex lex 4.0K Aug 16 00:26 security
drwxr-xr-x  3 lex lex 4.0K Aug 16 00:26 setup
drwxr-xr-x  2 lex lex  68K Aug 16 00:26 snippets

直接在浏览器中打开该文件，我们会发现文档是合并了，但缺失了格式：

因此，在转换成 pdf 之前，我们还需要解决格式的问题

确保单页文档的格式和内容的正确

build_doc 生成的这个单页的 HTML 的源码是这样的：

<!DOCTYPE html>
<html>
  <head>    
    <meta charset="UTF-8">
    <title>Elasticsearch Guide [7.10] | Elastic</title>
    <link rel="home" href="index.html" title="Elasticsearch Guide [7.10]"/>
    <link rel="next" href="elasticsearch-intro.html" title="What is Elasticsearch?"/>
    <meta name="DC.type" content="Learn/Docs/Elasticsearch/Reference/7.10"/>
    <meta name="DC.subject" content="Elasticsearch"/>
    <meta name="DC.identifier" content="7.10"/>
    <meta name="robots" content="noindex,nofollow"/>
  </head>
<body>
<div class="book" lang="en" id="content">
<div class="titlepage">
<div class="breadcrumbs" id="title-page-breadcrumb">
<span class="breadcrumb-link"><a href="/guide/">Elastic Docs</a></span>
</div>
<div>
<div><h1 class="title"><a id="elasticsearch-reference"></a>Elasticsearch Guide</h1></div>
</div>
<hr>
<!--EXTRA-->
</div>
<div id="content">
<div class="chapter">
<div class="titlepage"><div><div>
<h1 class="title"><a id="elasticsearch-intro"></a>What is Elasticsearch?</h1>

可以看到，在 <head> 中并没有 css。第一步，我们需要添加对应的 css 文件。我们可以用原先的命令，去掉 --single 参数，重新生成一个多页的文档：

./build_docs --doc /apps/elasticsearch/docs/reference/index.asciidoc --resource=/apps/elasticsearch/x-pack/docs/ --open

参考文件里的内容，添加 ccs：

<link rel="stylesheet" type="text/css" href="/guide/static/styles.css" />

而 ccs，可以直接从打开的网站上提取资源，也可以在这个网址：https://github.com/elastic/built-docs/tree/master/html/static 获取

但是光添加 css 是不够的，还需要有一个正确的渲染映射：

  <body>
<div class="book" lang="en" id="content">
<div class="titlepage">
<div class="breadcrumbs" id="title-page-breadcrumb">
<span class="breadcrumb-link"><a href="/guide/">Elastic Docs</a></span>
</div>
<div>
<div><h1 class="title"><a id="elasticsearch-reference"></a>Elasticsearch Guide</h1></div>
</div>
<hr>
<!--EXTRA-->
</div>
<div id="content">
<div class="chapter">
<div class="titlepage"><div><div>

在 body 里面，为了能够正确渲染，需要在 <div id="content"> 之前，加入如下代码：

    <div class="main-container">
      <section id="content" >
        <div class="content-wrapper">

          <section id="guide" lang="en">
            <div class="container">

单页 html 将正确应用和官网一样格式：

将单页文档变成 pdf

到这里，我们已经完成了将近80%的工作。将单页 html 转换成 pdf，我们可以使用很多现成的工具。但由于文档过大（十多M），我们很难使用在线工具转换（而且在线工具仅支持 url 的方式加载 html，意味着我们还得部署一个网站了承载这个单页的文档）。所以我们得选择一个离线的工具。

这里推荐的是 wkhtmltopdf，该工具可以从 wkhtmltopdf 下载。

该工具使用方式简单，只需要填入 source 和 dest 即可：

wkhtmltopdf http://google.com google.pdf

我们可以在本地单页 html 所在的目录，启动一个 web 服务器（python3 -m http.server 8080 | python -m SimpleHTTPServer 8080）

然后进行转换：

wkhtmltopdf http://localhost:8080 elasticsearch-guide.pdf

这时，你可能会遇到 ContentNotFoundError 问题。

其主要原因是 wkhtmltopdf 无法下载 html 中的链接资源，主要是:

`<link rel="stylesheet" type="text/css" href="/guide/static/styles.css" />`

中指向的资源目录 wkhtmltopdf 无法定位。

因此，这里需要改为:

<link rel="stylesheet" type="text/css" href="http://localhost:8080/static/styles.css" />

最终命令执行结果为：

wkhtmltopdf http://localhost:8080 elasticsearch-guide.pdf
Loading pages (1/6)
Counting pages (2/6)                                               
Resolving links (4/6)                                                       
Loading headers and footers (5/6)                                           
Printing pages (6/6)
Done

生成之后的 pdf 如下：