【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

【背景】

将一个不可拷贝的PDF文档中的表格数据：

导出来，并且另存为类似如下的xml的格式：

	<HartCompany Code="3004" Name="Flowserve" Description="Logix 3200-IQ"/>
	<HartCompany Code="3601" Name="Yamatake" Description="MagneW"/>
	<HartCompany Code="3602" Name="Yamatake" Description="ST3000"/>

【折腾过程】

1.PDF无法复制，所以无法拷贝粘贴出来了。

2.暂时手上没有那个之前弄过的，很强大的，可以将pdf转为word文件的那个软件。

记不清叫啥了。反正之前用过，很牛x的。

3.此刻能想到的只能是，写python脚本，处理pdf，抓取数据，存为xml的文本格式。

4.参考了一堆资料：

working on tables in pdf using python – Stack Overflow

Python module for converting PDF to text – Stack Overflow

slate 0.3 : Python Package Index

working on tables in pdf using python – Stack Overflow

pdftables – a Python library for getting tables out of PDF files | ScraperWiki

先后去：

【记录】尝试使用PDFMiner将不可复制的PDF转换为文本或HTML

5.然后再去：

【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

6.然后再去尝试：

【记录】尝试用xpdf将不可复制的PDF转换为文本或HTML

7.最后是用：

【记录】尝试使用pdftohtml将不可拷贝的PDF文件转换为HTML并保留表格的格式

8.所以，接着就真的可以去写Python脚本，去处理html，提取内容，导出为xml了。

其中会用到BeautifulSoup，不熟悉的可参考：

Python专题教程：BeautifulSoup详解

最终实现了效果：

把如下的一堆的从9到34的html：

对应html代码为：

ft05的第九页：

后来从第十页的ft03：

直到最后的第34页也是ft03：

最终用如下的代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据


Author:     Crifan Li
Version:    2014-01-27
Contact:    https://www.crifan.org/about/me
"""

import os
import sys
import codecs
from BeautifulSoup import BeautifulSoup;

def pdf_table_to_xml():
    """Extract data from HTML which is generated from PDF using pdftohtml, then saved to xml"""
    srcHtmlFolder = "D:\\tmp\\tmp_dev_root\\virutalbox\\ubuntu\\win7_to_ubuntu\\pdf_to_html_withTable"
    htmlFilenameList = []
    baseFilename = "hart183WithTable-"
    fileSuffix = ".html"

    #create output file
    outputXmlFilename = "GeneratedHartIdCompanyXml.xml";
    # 'a+': read,write,append
    # 'w' : clear before, then write
    #outputXmlFp = codecs.open(outputXmlFilename, 'w')
    outputXmlFp = codecs.open(outputXmlFilename, 'w', "UTF-8")

    #generate html file list to process
    # hart183WithTable-9.html to hart183WithTable-34.html
    for pageNum in range(9, 35):
        fullFilename = baseFilename + str(pageNum) + fileSuffix
        #print "fullFilename=",fullFilename;
        # fullFilename= hart183WithTable-9.html
        # fullFilename= hart183WithTable-10.html
        #fullFilename= hart183WithTable-34.html
        fullFile = os.path.join(srcHtmlFolder, fullFilename)
        #print "fullFile=",fullFile
        
        srcHtmlFp = open(fullFile)
        #print "srcHtmlFp=",srcHtmlFp
        srcHtml = srcHtmlFp.read()
        #print "srcHtml=",srcHtml

        foundAllFt = []
        paraLineNum = 0
        
        soup = BeautifulSoup(srcHtml, fromEncoding="UTF-8")
        #hart183WithTable-9.html
        # <P style="position:absolute;top:744px;left:108px;white-space:nowrap" class="ft05">0304&#160;</P>
        # <P style="position:absolute;top:744px;left:245px;white-space:nowrap" class="ft05">NEWTHERMOX&#160;</P>
        # <P style="position:absolute;top:744px;left:535px;white-space:nowrap" class="ft05">Ametek&#160;</P>
        # <P style="position:absolute;top:766px;left:108px;white-space:nowrap" class="ft05">0A01&#160;</P>
        # <P style="position:absolute;top:766px;left:245px;white-space:nowrap" class="ft05">TRI20&#160;</P>
        # <P style="position:absolute;top:766px;left:535px;white-space:nowrap" class="ft05">Brooks&#160;Instrument&#160;</P>
        foundAllFt05 = soup.findAll(name="p", attrs={"class":"ft05"})
        #print "foundAllFt05=",foundAllFt05
        ft05Len = len(foundAllFt05)
        print "ft05Len=",ft05Len
        
        #hart183WithTable-10.html
        # <P style="position:absolute;top:181px;left:81px;white-space:nowrap" class="ft03">1109&#160;</P>
        # <P style="position:absolute;top:181px;left:218px;white-space:nowrap" class="ft03">DELTBS/Deltabar&#160;S&#160;</P>
        # <P style="position:absolute;top:181px;left:508px;white-space:nowrap" class="ft03">Endress&#160;&amp;&#160;Hauser&#160;</P>
        # <P style="position:absolute;top:204px;left:81px;white-space:nowrap" class="ft03">110A&#160;</P>
        # <P style="position:absolute;top:204px;left:218px;white-space:nowrap" class="ft03">FMU231/FMU13x&#160;</P>
        # <P style="position:absolute;top:204px;left:508px;white-space:nowrap" class="ft03">Endress&#160;&amp;&#160;Hauser&#160;</P>
        
        #hart183WithTable-34.html
        # <P style="position:absolute;top:181px;left:81px;white-space:nowrap" class="ft03">E183&#160;</P>
        # <P style="position:absolute;top:181px;left:218px;white-space:nowrap" class="ft03">Radar&#160;Lvl&#160;Transmitter&#160;</P>
        # <P style="position:absolute;top:181px;left:508px;white-space:nowrap" class="ft03">FUTURE&#160;INSTRUMENT&#160;</P>
        # <P style="position:absolute;top:204px;left:81px;white-space:nowrap" class="ft03">E184&#160;</P>
        # <P style="position:absolute;top:204px;left:218px;white-space:nowrap" class="ft03">EA10S&#160;</P>
        # <P style="position:absolute;top:204px;left:508px;white-space:nowrap" class="ft03">MOTOYAMA&#160;</P>
        foundAllFt03 = soup.findAll(name="p", attrs={"class":"ft03"})
        #print "foundAllFt03=",foundAllFt03
        ft03Len = len(foundAllFt03)
        print "ft03Len=",ft03Len

        if((ft05Len > 1) and (0 == (ft05Len % 3))):
            print "+++ ft05 is real table data for ",fullFile
            paraLineNum = ft05Len
            foundAllFt = foundAllFt05
        elif((ft03Len > 1) and (0 == (ft03Len % 3))):
            print "+++ ft03 real table data for ",fullFile
            paraLineNum = ft03Len
            foundAllFt = foundAllFt03
        else:
            print "--- Not found valid table data for ",fullFile
            sys.exit(-2)
        
        #real start extrat data
        totalRowNum = paraLineNum/3
        print "totalRowNum=",totalRowNum
        for rowIdx in range(totalRowNum):
            def postProcessStr(origStr):
                """do some post process for input str"""
                processedStr = origStr.replace("&#160;", " ")
                #processedStr = processedStr.replace("&amp;", "&")
                processedStr = processedStr.strip()
                return processedStr
                
            hartCodeSoup = foundAllFt[rowIdx*3 + 0]
            hartCodeUni = unicode(hartCodeSoup.string)
            hartCodeUni = postProcessStr(hartCodeUni)

            hartDescSoup = foundAllFt[rowIdx*3 + 1]
            hartDescUni = unicode(hartDescSoup.string)
            hartDescUni = postProcessStr(hartDescUni)

            hartNameSoup = foundAllFt[rowIdx*3 + 2]
            hartNameUni = unicode(hartNameSoup.string)
            hartNameUni = postProcessStr(hartNameUni)

            #	<HartCompany Code="3701" Name="Yokogawa" Description="YEWFLO"/>
            xmlLineStr = '	<HartCompany Code="' + hartCodeUni + '" Name="' + hartNameUni + '" Description="' + hartDescUni + '"/>' + '\n'
            #print "xmlLineStr=",xmlLineStr

            #save data
            outputXmlFp.write(xmlLineStr)

    #save and close output file
    outputXmlFp.flush()
    outputXmlFp.close()

if __name__ == "__main__":
    pdf_table_to_xml();

运行：

最终生成了对应的xml文件内容：

【总结】

最终是：

通过pdftohtml，把不可拷贝的PDF，导出为html；

再写python脚本，去处理这么一堆的html文件，然后提取其中的数据，导出为xml形式的内容。

注：

1.使用pdftohtml时，要加上-nodrm参数，才能保留表格格式

2.此处，生成的html中有个别的表格内部数据有特殊的，需要手动处理一下，把个别的ft06换成ft03即可。

3.python脚本中，是利用BeautifulSoup去处理html的。其实自己熟悉正则表达式的话，也是可以不用BeautifulSoup而直接用正则去匹配提取所需数据的。

转载请注明：在路上 » 【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (2)