【背景】
折腾:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
期间,打算去试试使用PDFMiner去把PDF,且是个加了密,不可拷贝的PDF,看看能否转换为文本或HTML。
【折腾过程】
1.找到主页:
去:
https://pypi.python.org/pypi/pdfminer/
下载:
2.解压后去安装:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>python setup.py install running install running build running build_py creating build creating build\lib creating build\lib\pdfminer copying pdfminer\arcfour.py -> build\lib\pdfminer copying pdfminer\ascii85.py -> build\lib\pdfminer copying pdfminer\ccitt.py -> build\lib\pdfminer copying pdfminer\cmapdb.py -> build\lib\pdfminer copying pdfminer\converter.py -> build\lib\pdfminer copying pdfminer\encodingdb.py -> build\lib\pdfminer copying pdfminer\fontmetrics.py -> build\lib\pdfminer copying pdfminer\glyphlist.py -> build\lib\pdfminer copying pdfminer\image.py -> build\lib\pdfminer copying pdfminer\latin_enc.py -> build\lib\pdfminer copying pdfminer\layout.py -> build\lib\pdfminer copying pdfminer\lzw.py -> build\lib\pdfminer copying pdfminer\pdfcolor.py -> build\lib\pdfminer copying pdfminer\pdfdevice.py -> build\lib\pdfminer copying pdfminer\pdfdocument.py -> build\lib\pdfminer copying pdfminer\pdffont.py -> build\lib\pdfminer copying pdfminer\pdfinterp.py -> build\lib\pdfminer copying pdfminer\pdfpage.py -> build\lib\pdfminer copying pdfminer\pdfparser.py -> build\lib\pdfminer copying pdfminer\pdftypes.py -> build\lib\pdfminer copying pdfminer\psparser.py -> build\lib\pdfminer copying pdfminer\rijndael.py -> build\lib\pdfminer copying pdfminer\runlength.py -> build\lib\pdfminer copying pdfminer\utils.py -> build\lib\pdfminer copying pdfminer\__init__.py -> build\lib\pdfminer running build_scripts creating build\scripts-2.7 copying and adjusting tools\pdf2txt.py -> build\scripts-2.7 copying and adjusting tools\dumppdf.py -> build\scripts-2.7 copying and adjusting tools\latin2ascii.py -> build\scripts-2.7 running install_lib creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\arcfour.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\ascii85.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\ccitt.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\cmapdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\converter.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\encodingdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\fontmetrics.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\glyphlist.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\image.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\latin_enc.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\layout.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\lzw.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfcolor.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfdevice.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfdocument.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdffont.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfinterp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfpage.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdftypes.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\psparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\rijndael.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\runlength.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\arcfour.py to arcfour.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ascii85.py to ascii85.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ccitt.py to ccitt.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\cmapdb.py to cmapdb.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\converter.py to converter.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\encodingdb.py to encodingdb.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\fontmetrics.py to fontmetrics.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\glyphlist.py to glyphlist.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\image.py to image.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\latin_enc.py to latin_enc.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\layout.py to layout.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\lzw.py to lzw.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfcolor.py to pdfcolor.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdevice.py to pdfdevice.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdocument.py to pdfdocument.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdffont.py to pdffont.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfinterp.py to pdfinterp.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfpage.py to pdfpage.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfparser.py to pdfparser.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdftypes.py to pdftypes.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\psparser.py to psparser.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\rijndael.py to rijndael.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\runlength.py to runlength.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\utils.py to utils.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\__init__.py to __init__.pyc running install_scripts copying build\scripts-2.7\dumppdf.py -> D:\tmp\dev_install_root\Python27_x64\Scripts copying build\scripts-2.7\latin2ascii.py -> D:\tmp\dev_install_root\Python27_x64\Scripts copying build\scripts-2.7\pdf2txt.py -> D:\tmp\dev_install_root\Python27_x64\Scripts running install_egg_info Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer-20131113-py2.7.egg-info D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>
然后再去试试。
然后在:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools
中找到:
pdf2txt.py
然后去试试:
3.结果竟然出错:
D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>ls
spec183r21.0.pdf  xml
D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd -
The system cannot find the path specified.
D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>ls
Makefile  PKG-INFO  build     cmaprsrc  docs      pdfminer  samples   setup.py  tools
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>cd tools
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>ls
Makefile           conv_afm.py        conv_cmap.py       conv_glyphlist.py  dumppdf.py         latin2ascii.py     pdf2html.cgi       pdf2txt.py         prof.py            runapp.py
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf\spec183
r21.0.pdf
Traceback (most recent call last):
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main
    caching=caching, check_extractable=True):
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages
    doc = PDFDocument(parser, caching=caching)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__
    xref.load(parser)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load
    objid1 = objs[index*2]
IndexError: list index out of range
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>4.加了-t参数,也还是不行:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -t html -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf
\spec183r21.0.pdf
Traceback (most recent call last):
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main
    caching=caching, check_extractable=True):
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages
    doc = PDFDocument(parser, caching=caching)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__
    xref.load(parser)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load
    objid1 = objs[index*2]
IndexError: list index out of range5.然后再去试试,看看能否用PDFMiner去解密,结果没有找到这些选项。。。
【总结】
最终放弃使用PDFMiner,暂时由于该程序有bug,无法用其将pdf转换为html或文本。