最新消息:20210917 已从crifan.com换到crifan.org

【已解决】Python同步印象笔记帖子到WordPress后丢失缩进

WordPress crifan 312浏览 0评论
折腾:
【未解决】自己写Python脚本同步印象笔记到WordPress
期间,发现带缩进的帖子,发布后缩进丢失,所以先去:
【已解决】印象笔记帖子发布到WordPress后丢失缩进的原因
然后去写代码处理,不过此贴已发布,再去找别的带缩进的帖子供测试
找到2个:
和:
  • 第二个缩进中:
    • 文字还带加粗的
      • 注意如何处理和支持
    • 文字中还带title+link类型
      • 缩进支持即可
      • 之前的合并title+link,就不支持也无所谓了
去调试
发现此处,效果:
Evernote的html
  <div>是对的</div>
  <ul>
    <li>
      <div>0 为集成显卡</div>
    </li>
    <li>
      <div>1 为独立显卡</div>
    </li>
    <li>
      <div>2 为自动切换</div>
    </li>
  </ul>
  <div><br /></div>
  <div>mac 插电/电池 自动切换显卡/独显 - 知乎</div>
  <div><a href="https://zhuanlan.zhihu.com/p/132679059">https://zhuanlan.zhihu.com/p/132679059</a></div>
  <div>也可以根据情况设置模式</div>
  <ul>
    <li>
      <div>sudo pmset -b GPUSwitch 0</div>
    </li>
    <ul>
      <li>
        <div>-b 表示 battery 为电池模式</div>
      </li>
      <ul>
        <li>
          <div>电池模式时:用过0 集成显卡 integrated graphics</div>
        </li>
      </ul>
    </ul>
    <li>
      <div>sudo pmset -c GPUSwitch 1</div>
    </li>
    <ul>
      <li>
        <div>-c 表示 charger 为电源模式</div>
      </li>
      <ul>
        <li>
          <div>电源模式时:用1=独立显卡 high performance graphic cards</div>
        </li>
      </ul>
    </ul>
  </ul>
去处理ul
先要搜索,最顶层的ul,即<en-note>下面的ul
其中涉及到:
【已解决】BeautifulSoup中只搜索当前直接子节点不搜索其他子孙节点
继续写代码处理
感觉处理逻辑是:
找到最底层元素:没有child了
然后判断当前是否是div,string是纯字符串
且div的parent是li,则把div去掉,只保留string给li即可
另外:后续 当遇到了 再去处理
当最底层div 不是纯string的情况
去试试
    for curUlIdx, eachUlSoup in enumerate(directUlSoupList):
        logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20)
        logging.info("before eachUlSoup=%s", eachUlSoup)
        removeDivInUl(eachUlSoup)
        logging.info("after  eachUlSoup=%s", eachUlSoup)

def removeDivInUl(curSoup):
    """Remove unuseful / redundant div node inside ul->li
    """
    logging.info("parentSoup=%s", curSoup)
    if curSoup.children:
        # has child
        for eachChild in curSoup.children:
            removeDivInUl(eachChild)
    else:
        # no child
        isSelfDiv = curSoup.name == "div"
        # current is li ?
        if isSelfDiv:
            parentSoup = curSoup.parent
            logging.info("parentSoup=%s", parentSoup)
            if parentSoup:
                isParentLi = parentSoup.name == "li"
                if isParentLi:
                    parentSoup.string = curSoup.string
                    parentSoup.children = []
                    logging.info("parentSoup=%s", parentSoup)

    return
调试看看
期间:
【已解决】BeautifulSoup中如何实现子节点的内容替换
期间把:
20201201 05:07:00 EvernoteToWordpress.py:542  INFO    before eachUlSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul>
变成了:
20201201 05:09:57 EvernoteToWordpress.py:544  INFO    after  eachUlSoup=<ul><li>0 为集成显卡</li><li>1 为独立显卡</li><li>2 为自动切换</li></ul>
是我们希望的。
继续调试

20201201 05:10:49 EvernoteToWordpress.py:541  INFO    -------------------- 1 --------------------
20201201 05:10:51 EvernoteToWordpress.py:542  INFO    before eachUlSoup=<ul><li><div>sudo pmset -b GPUSwitch 0</div></li><ul><li><div>-b 表示 battery 为电池模式</div></li><ul><li><div>电池模式时:用过0 集成显卡 integrated graphics</div></li></ul></ul><li><div>sudo pmset -c GPUSwitch 1</div></li><ul><li><div>-c 表示 charger 为电源模式</div></li><ul><li><div>电源模式时:用1=独立显卡 high performance graphic cards</div></li></ul></ul></ul>
。。。
20201201 05:11:07 EvernoteToWordpress.py:544  INFO    after  eachUlSoup=<ul><li>sudo pmset -b GPUSwitch 0</li><ul><li>-b 表示 battery 为电池模式</li><ul><li>电池模式时:用过0 集成显卡 integrated graphics</li></ul></ul><li>sudo pmset -c GPUSwitch 1</li><ul><li>-c 表示 charger 为电源模式</li><ul><li>电源模式时:用1=独立显卡 high performance graphic cards</li></ul></ul></ul>
格式化html后效果对比
是希望的效果。
【总结】
然后看看最后处理结果
最后用代码:
    # process list (ul/ol/...) indent
    noteDetail = processListIndent(noteDetail)

def processListIndent(curNote):
    """process list (ul/ol/...) indent

    Args:
        curNote (Note): evernote Note
    Returns:
        Note
    Raises:
    """
    soup = utils.htmlToSoup(curNote.content)
    enNoteSoup = soup.find("en-note")

    # allSubUlSoupList = enNoteSoup.find_all("ul")
    # allSubUlSoupNum = len(allSubUlSoupList)
    # logging.info("Found %d all sub level ul list", allSubUlSoupNum)

    directUlSoupList = enNoteSoup.find_all("ul", recursive=False)
    directUlSoupNum = len(directUlSoupList)
    logging.info("Found %d top level ul list", directUlSoupNum)

    for curUlIdx, eachUlSoup in enumerate(directUlSoupList):
        logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20)
        logging.info("before eachUlSoup=%s", eachUlSoup)
        removeDivInUl(eachUlSoup)
        logging.info("after  eachUlSoup=%s", eachUlSoup)

    # soup changed, write back to note content
    updatedNoteHtml = crifanEvernote.soupToNoteContent(enNoteSoup)
    curNote.content = updatedNoteHtml

    return curNote

def htmlToSoup(curHtml):
    """convert html to soup

    Args:
        curHtml (str): html str
    Returns:
        soup
    Raises:
    """
    soup = BeautifulSoup(curHtml, 'html.parser')
    return soup

def removeDivInUl(curSoup):
    """Remove unuseful / redundant div node inside ul->li
    """
    curSoupType = type(curSoup)
    logging.info("curSoupType=%s, curSoup=%s", curSoupType, curSoup)
    # curSoupType=<class 'bs4.element.Tag'>, curSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul>

    if not isinstance(curSoup, Tag):
        # curSoupType=<class 'bs4.element.NavigableString'>, curSoup=0 为集成显卡
        return

    childSoupList = list(curSoup.children)

    if curSoup.name != "div":
        # self is not div, just process each child
        for eachChildSoup in childSoupList:
            removeDivInUl(eachChildSoup)
        return

    parentSoup = curSoup.parent
    logging.info("parentSoup=%s", parentSoup)
    if parentSoup:
        if parentSoup.name != "li":
            # has parent, but no li
            return
    else:
        # no parent?
        logging.warning("to support")

    # isOnlyChildNotSoup = False
    # if childSoupList:
    #     childNum = len(childSoupList)
    #     if childNum == 1:
    #         onliyChildSoup = childSoupList[0]
    #         isOnlyChildIsSoup = isinstance(onliyChildSoup, Tag)
    #         if isOnlyChildIsSoup:
    #             # process it
    #             parentSoup.string = curSoup.string
    #             parentSoup.children = []
    #             logging.info("parentSoup=%s", parentSoup)
    #         else:
    #             logging.info("type(onliyChildSoup)=%s", type(onliyChildSoup))
    #             # type(onliyChildSoup)=<class 'bs4.element.NavigableString'>
    #             return
    #     else:
    #         for eachChildSoup in childSoupList:
    #             removeDivInUl(eachChildSoup)
    # else:
    #     # no child
    #     logging.warning("to support")

    # process it
    # parentSoup.contents = curSoup.contents
    # parentSoup.children = []

    # curSoup.name = "li"
    logging.info("before replace: curSoup=%s", curSoup)
    logging.info("before replace: parentSoup=%s", parentSoup)
    # curSoupCopy = copy.deepcopy(curSoup)
    # parentSoup.replace_with(curSoupCopy)
    # parentSoup.children = curSoup.children

    # Prerequisite: li only have one div child !
    parentSoup.div.unwrap()
    logging.info("aftre  replace: parentSoup=%s", parentSoup)
    logging.info("aftre  replace: curSoup=%s", curSoup)

    return

    @staticmethod
    def soupToNoteContent(soup):
        """Convert BeautifulSoup Soup to Evernote Note content

        Args:
            soup (Soup): BeautifulSoup Soup
        Returns:
            Evernote Note content html(str)
        Raises:
        """
        noteContentHtml = utils.soupToHtml(soup)

        noteContentHtml = crifanEvernote.convertToClosedEnMediaTag(noteContentHtml)

        # add first line
        # <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
        noteContentHtml = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\n' + noteContentHtml

        return noteContentHtml

def soupToHtml(soup):
    """Convert soup to html string

    Args:
        soup (Soup): BeautifulSoup soup
    Returns:
        html (str)
    Raises:
    """
    curHtml = soup.prettify()
    # curHtml = str(soup)
    return curHtml

    @staticmethod
    def convertToClosedEnMediaTag(noteHtml):
        """Process note content html, for special </en-media> will cause error, so need convert:
                <en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png"></en-media>
            to closed en-media tag:
                <en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png" />
        Args:
            noteHtml (str): Note content html
        Returns:
            note content html with closed en-media tag (str)
        Raises:
        """
        noteHtml = re.sub("(?P<enMedia><en-media\s+[^<>]+)>\s*</en-media>", "\g<enMedia> />", noteHtml, flags=re.S)
        return noteHtml
把:
可以把:
    <ul>
        <li>
            <div>sudo pmset -b GPUSwitch 0</div>
        </li>
        <ul>
            <li>
                <div>-b = battery =电池模式</div>
            </li>
            <ul>
                <li>
                    <div>电池模式 用 0 集成显卡 integrated graphics</div>
                </li>
            </ul>
        </ul>
        <li>
            <div>sudo pmset -c GPUSwitch 1</div>
        </li>
        <ul>
            <li>
                <div>-c = charger = 电源模式</div>
            </li>
            <ul>
                <li>
                    <div>电源模式 用 1 独立显卡 high performance graphic cards</div>
                </li>
            </ul>
        </ul>
    </ul>
变成了:
    <ul>
        <li>
            sudo pmset -b GPUSwitch 0
        </li>
        <ul>
            <li>
                -b = battery =电池模式
            </li>
            <ul>
                <li>
                    电池模式 用 0 集成显卡 integrated graphics
                </li>
            </ul>
        </ul>
        <li>
            sudo pmset -c GPUSwitch 1
        </li>
        <ul>
            <li>
                -c = charger = 电源模式
            </li>
            <ul>
                <li>
                    电源模式 用 1 独立显卡 high performance graphic cards
                </li>
            </ul>
        </ul>
    </ul>
是符合预期的。
至此,即可保留缩进了。
【后记20201205】
不过后来还是缩进丢失,最后是找到是WordPress的配置导致的。
解决办法:设置-》撰写-》格式-》取消勾选:让WordPress自动校正嵌套错误的XHTML代码

详见:
【已解决】Python发布带缩进的html到WordPress后html被改变缩进丢失问题
【已解决】WordPress中的html为何会被改变导致ul+li的缩进丢失
【后记20201206】
某个帖子,更新后,缩进保留,没丢失的效果:

转载请注明:在路上 » 【已解决】Python同步印象笔记帖子到WordPress后丢失缩进

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
93 queries in 0.183 seconds, using 23.39MB memory