经常发现很多用gitbook生成的书籍质量很高
就想离线下来看
但是gitbook生成的pdf都无法复制且体积很大
有的网站甚至不提供下载的选项
就和小伙伴一起做了个工具
对于gitbook生成的网站进行抓取
解析以后使用weasyprint进行生成文件
异步抓取
使用aiohttp抓取
对于网站内容抓取基本秒速完成
文本可复制

保持原目录结构

保留原文链接

项目地址:gitbook2pdf
|  |      1fuergaosi OP 求 star | 
|  |      2magicZ      2019-03-07 10:28:11 +08:00 给个链接呀 | 
|  |      3fuergaosi OP 忘记放链接了  gitbook2pdf: https://github.com/fuergaosi233/gitbook2pdf | 
|  |      422k      2019-03-07 10:32:00 +08:00 昨天还在想着有没有能下载 gitbook 的书籍,mark 一下,楼主可以分享的话更新下原帖。谢谢大佬 | 
|  |      6changjiangzzZ      2019-03-07 11:22:48 +08:00 已 star :) | 
|      7newmind      2019-03-07 11:27:17 +08:00 效果很不错, 已赞 | 
|      8newmind      2019-03-07 11:28:13 +08:00 要是能有个在线版就更好了 | 
|  |      9jasonslyvia      2019-03-07 11:55:25 +08:00 赞,一直想要一个这样的工具,希望能持续打磨! | 
|  |      10FakeLeung      2019-03-07 11:59:18 +08:00 没有 usage 吗? 看代码貌似是直接修改 main 里面那个 run 的 url ? ps:github 地址可以 append。 | 
|      11fffflyfish      2019-03-07 12:19:46 +08:00 点赞!终于看到有人做了 | 
|  |      12mseasons      2019-03-07 14:31:23 +08:00 aiohttp.client_exceptions.ClientConnectorCertificateError: Cannot connect to host wizardforcel.gitbooks.io:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)')] | 
|  |      13d5      2019-03-07 14:34:09 +08:00 楼主可以考虑做一个在线版,后端放在外地主机上~ | 
|  |      14privil      2019-03-07 16:32:32 +08:00 ……好像比较吃内存,被 kill 掉了 | 
|      15tongdongdong      2019-03-07 18:59:15 +08:00 C:\Users\TDD\Desktop>python -m weasyprint https://ts.xcatliu.com ts.pdf WARNING: Ignored `text-rendering:auto` at 4:620, unknown property. WARNING: Ignored `filter:none` at 4:2882, unknown property. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:83. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:669. WARNING: Ignored `box-shadow:none` at 9:1092, unknown property. WARNING: Ignored `text-overflow:ellipsis` at 9:1686, unknown property. WARNING: Expected a media type, got (max-width:1000px) WARNING: Invalid media type " (max-width:1000px)" the whole @media rule was ignored at 9:1805. WARNING: Ignored `box-shadow:0 6px 12px rgba(0,0,0,.175)` at 9:2336, unknown property. WARNING: Ignored `overflow-y:auto` at 9:3908, unknown property. WARNING: Ignored `text-overflow:ellipsis` at 9:4934, unknown property. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5254. WARNING: Expected a media type, got (min-width:600px) WARNING: Invalid media type " (min-width:600px)" the whole @media rule was ignored at 9:5583. WARNING: Expected a media type, got (max-width:600px) WARNING: Invalid media type " (max-width:600px)" the whole @media rule was ignored at 9:5650. WARNING: Ignored `overflow-y:auto` at 9:6180, unknown property. WARNING: Ignored `overflow-y:auto` at 9:6418, unknown property. WARNING: Expected a media type, got (max-width:1240px) WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:6434. WARNING: Ignored `text-size-adjust:100%` at 9:7377, unknown property. WARNING: Expected a media type, got (max-width:1240px) WARNING: Invalid media type " (max-width:1240px)" the whole @media rule was ignored at 9:11595. WARNING: Ignored `box-shadow:none` at 9:12111, unknown property. WARNING: Ignored `text-size-adjust:100%` at 9:12512, unknown property. WARNING: Ignored `text-rendering:optimizeLegibility` at 9:20972, unknown property. WARNING: Ignored `font-smoothing:antialiased` at 9:21006, unknown property. WARNING: Ignored `text-size-adjust:100%` at 9:21124, unknown property. WARNING: Ignored `box-shadow: none` at 235:3, unknown property. WARNING: Ignored `box-shadow: none` at 272:3, unknown property. 然后只有首页转成功了!!! | 
|  |      16changjiangzzZ      2019-03-07 19:02:54 +08:00 @tongdongdong 老哥麻烦看看文档先~ | 
|  |      17changjiangzzZ      2019-03-07 19:04:38 +08:00 @mseasons 国内网络环境不太好,连接的时候 timeout 了,添加个代理试试 | 
|  |      18fuergaosi OP @privil 吃内存是因为`weasyprint`的问题 正在尝试分片输出 @tongdongdong 出门左转`weasyprint`的 issues 区 @mseasons 我无法访问这个 url 不知道你是怎么访问的 希望你可以把问题以及抓取的 url 发在`issues`区 @FakeLeung 感谢提醒 之前没找到 append 的按钮╮(╯_╰)╭ 另外目前是修改 url 使用 等下改一下使用方法 之前一直这样测试 就没注意这些方面 | 
|      19Ahs      2019-03-07 19:14:26 +08:00 via Android 已 Star | 
|  |      21aWangami      2019-03-07 19:27:16 +08:00 (Python3) ➜  gitbook2pdf python gitbook.py Traceback (most recent call last): File "gitbook.py", line 5, in <module> import weasyprint File "/Users/Python3/lib/python3.7/site-packages/weasyprint/__init__.py", line 393, in <module> from .css import preprocess_stylesheet # noqa File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/__init__.py", line 26, in <module> from . import computed_values File "/Users/Python3/lib/python3.7/site-packages/weasyprint/css/computed_values.py", line 17, in <module> from .. import text File "/Users/Python3/lib/python3.7/site-packages/weasyprint/text.py", line 14, in <module> import cairocffi as cairo File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 39, in <module> cairo = dlopen(ffi, 'cairo', 'cairo-2', 'cairo-gobject-2', 'cairo.so.2') File "/Users/Python3/lib/python3.7/site-packages/cairocffi/__init__.py", line 36, in dlopen raise OSError("dlopen() failed to load a library: %s" % ' / '.join(names)) OSError: dlopen() failed to load a library: cairo / cairo-2 / cairo-gobject-2 / cairo.so.2 这是啥情况? | 
|  |      22privil      2019-03-07 19:28:17 +08:00 @fuergaosi #18 抓取的时候也报错了,不过我 vps 内存真小,才 512Mb,抓原来的 k8s handbook 是不行的。 https://funhacks.gitbooks.io/explore-python crawling : https://funhacks.gitbooks.io/explore-python/Conclusion/reference_material.html Traceback (most recent call last): File "gitbook.py", line 298, in <module> Gitbook2PDF("https://funhacks.gitbooks.io/explore-python/").run() File "gitbook.py", line 190, in run loop.run_until_complete(self.crawl_main_content(content_urls)) File "/usr/local/python3.7.2/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete return future.result() File "gitbook.py", line 212, in crawl_main_content await asyncio.gather(*tasks) File "gitbook.py", line 233, in gettext text = ChapterParser(metatext, level).parser() File "gitbook.py", line 95, in parser if len(context.find('footer')): TypeError: object of type 'NoneType' has no len() | 
|  |      23privil      2019-03-07 19:30:23 +08:00 | 
|  |      24hooych      2019-03-07 19:38:27 +08:00 | 
|  |      26fuergaosi OP @privil 无法重现 这个报错是官方推荐的锅 我本来没有写 len 今天跑的时候官方提示我以后可能不让直接 if None 了 就推荐写成这样 结果成了个 bug 我这就去改 | 
|  |      27mseasons      2019-03-07 22:15:15 +08:00 @changjiangzzZ 不是 timeout 的问题,似乎是 https 验证的问题。我把所有的 get 请求参数增加 verify=False 就好了。 | 
|  |      28mseasons      2019-03-07 22:18:23 +08:00 @fuergaosi url 我没改,直接 git clone 下来运行的源码。我后面查了一下文档,将所有的 get 请求增加参数 verify=False 就通过了。 | 
|  |      29dyxang      2019-03-07 22:24:18 +08:00 via Android 好想直接用,为什么不 py2exe ? | 
|      30leesymbol      2019-03-08 08:22:04 +08:00 via iPhone 帮顶 | 
|  |      31cye3s      2019-03-08 11:25:50 +08:00 试了个,目录结构没保留啊,比如这个 https://go.tanglei.name/content | 
|  |      32fuergaosi OP @cye3s 我测试了一下 目录结构保留了 不过因为有两个 404 所以少了两个章节  另外希望有问题可以直接发到 issues 区 @dyxang 因为我没有 windows ┑( ̄Д  ̄)┍ | 
|  |      33soulteary      2019-05-07 23:51:07 +08:00 @fuergaosi 你的小工具很好用鸭,但是看到有些同学搞不定环境,所以我封装了一个容器镜像,代码在这里: https://github.com/soulteary/docker-gitbook-pdf-generator 如果你愿意稍微调整项目目录结构 & 打 release tag 的话,后续升级维护能够更方便,比如定制电子书风格, etc... |