【已解决】用Python把印象笔记中标题和链接合并一起

折腾：

【未解决】用Python把印象笔记中的文章内容部分转换成html用于后续上传到WordPress

期间，需要把

中对应的html

 <div>
  taobao Mirrors
 </div>
 <div>
  <a href="https://npm.taobao.org/mirrors/">
   https://npm.taobao.org/mirrors/
  </a>
 </div>

 <div>
  Python Mirror
 </div>
 <div>
  <a href="https://npm.taobao.org/mirrors/python/">
   https://npm.taobao.org/mirrors/python/
  </a>
 </div>

中的标题和链接合并再一起：

把文字加上链接

以及，不要误判，不要把

<div>
  即可-》后续pip下载时，就会从
 </div>
 <div>
  <a href="http://mirrors.aliyun.com/">
   mirrors.aliyun.com
  </a>
 </div>

目前看起来，可以采用的判断逻辑是：

title标题中，要包含空格，才合并

否则不合并

以及要合并的规则是：

title+url

中的url中是：

div -> a，以及a中的string和href值相同

所以就要去找：

div下面，有且只有一个a的，且a的href和string相同

然后再去找div前面存在一个div：

且该div下面只有string，没有其他子节点

以及string中要包含空格

然后才去合并

对于合并后的规则，去研究看看：

另外还有一个，本来是：

 <div>
  MacOS 下利用 pyenv 管理Python 版本和虚拟环境 - 掘金
 </div>
 <div>
  <a href="https://juejin.im/post/5c739c86e51d45699514ee0c">
   https://juejin.im/post/5c739c86e51d45699514ee0c
  </a>
 </div>

然后故意合并在一起：

调试看看，html是啥

先要去搜a，且带href的

Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation

但是不清楚：如何写有href，但是不为空

当前href值可以用正则，但是觉得没必要

记得有默认表达的

不过待会可以参考：

from bs4 importNavigableString
def surrounded_by_strings(tag):
    return(isinstance(tag.next_element,NavigableString)
            and isinstance(tag.previous_element,NavigableString))


fortag insoup.find_all(surrounded_by_strings):
    printtag.name
# p
# a
# a
# a
# p

去实现：一次性搜索出来要的值

即：

把各个条件写成判断函数，直接找

a，带href，然后parent必须是div的，且a没有兄弟节点的

参考：

soup.find_all(id=True)

好像可以写成：

aNodeList = soup.find_all("a", attrs={"href": True})

去试试

是可以的。

另外对于

<div><a href="http://mirrors.aliyun.com/">mirrors.aliyun.com</a></div>

想要去判断a节点是没有child的

结果用

        childrenGenerator = eachANode.children
        childList = list(childrenGenerator)
        if childList:

却发现childList是有值的

['mirrors.aliyun.com']

再去找，如何获取到子节点，判断子节点为空的逻辑

好像也只能就这么判断

不过应该换用descendants

        descendantGenerator = eachANode.descendants
        descendantList = list(descendantsGenerator)
        if descendantList:

此处也是

['mirrors.aliyun.com']

所以只能去判断

有且只有一个child

且直接是str值就是前面的aStr

然后就是复杂的逻辑判断

其中再去看：

【已解决】BeautifulSoup中如何删除某个节点

【总结】

最后此处用代码：

def mergePostTitleAndUrl(soup):
    """Merge post title and url


    Args:
        soup (BeautifulSoup soup): soup of evernote post
    Returns:
        processed soup
    Raises:
    """


    """
    <div>
        Python Mirror
    </div>
    <div>
        <a href="https://npm.taobao.org/mirrors/python/">
            https://npm.taobao.org/mirrors/python/
        </a>
    </div>
    """


    aNodeList = soup.find_all("a", attrs={"href": True})
    aNodeListLen = len(aNodeList)
    
    for eachANode in aNodeList:
        # prevSiblingList = eachANode.find_previous_siblings()
        # nextSiblingList = eachANode.find_next_siblings()
        prevSiblingGenerator = eachANode.previous_siblings
        prevSiblingList = list(prevSiblingGenerator)
        nextSiblingGenerator = eachANode.next_siblings
        nextSiblingList = list(nextSiblingGenerator)
        if prevSiblingList or nextSiblingList:
            continue


        aStr = eachANode.string
        if not aStr:
            continue
        aStr = aStr.strip()
        if not aStr:
            continue


        hrefValue = eachANode["href"]
        if not hrefValue:
            continue


        # <div><a href="https://npm.taobao.org/mirrors/">https://npm.taobao.org/mirrors/</a></div>
        # <div><a href="https://npm.taobao.org/mirrors/python/">https://npm.taobao.org/mirrors/python/</a></div>
        # if hrefValue != aStr:
        hrefP = "(https?://)?%s/?" % aStr
        isSameUrl = re.match(hrefP, hrefValue, re.I)
        isNotSameUrl = not isSameUrl
        if isNotSameUrl:
            # (1) <div><a href="https://juejin.im/post/5c739c86e51d45699514ee0c">MacOS 下利用 pyenv 管理Python 版本和虚拟环境 - 掘金</a></div>
            # (2) <div><a href="http://mirrors.aliyun.com/">mirrors.aliyun.com</a></div>
            #       -> '(https?://)?mirrors.aliyun.com/?' == 'http://mirrors.aliyun.com/'
            continue


        isCurNoChild = isNoMoreChildren(eachANode)
        isCurHasChild = not isCurNoChild
        if isCurHasChild:
            continue


        # only one parent: div
        # parentDivNode = eachANode.find_parent("div")
        parentDivNode = eachANode.parent
        if not parentDivNode:
            continue


        # parent prev is div
        # parentPrevSibling = parentDivNode.find_previous_sibling()
        parentPrevSibling = parentDivNode.previous_sibling
        isParentPrevSiblingNotExist = not parentPrevSibling
        if isParentPrevSiblingNotExist:
            continue


        isParentPrevSiblingNameNotDiv = parentPrevSibling.name != "div"
        if isParentPrevSiblingNameNotDiv:
            continue


        parentPrevSiblingStr = parentPrevSibling.string
        isParentPrevSiblingStrEmpty = not parentPrevSiblingStr
        if isParentPrevSiblingStrEmpty:
            continue


        isParentPrevSiblingNoChild = isNoMoreChildren(parentPrevSibling)
        isParentPrevSiblingHasChild = not isParentPrevSiblingNoChild
        if isParentPrevSiblingHasChild:
            continue


        # other possible logic check
        # (1) title best contain some char: ' ' or '|' or '-'
        foundSpecialCharInTitle = re.search("[ \|\—]", parentPrevSiblingStr)
        isTitleNoSpecialChar = not foundSpecialCharInTitle
        if isTitleNoSpecialChar:
            continue


        # match all condition -> merge title and url
        # delete div
        parentPrevSibling.decompose()
        # replace new a node
        eachANode.string = parentPrevSiblingStr


    return soup

处理后，即可实现普通的把title和url合并，处理后的html打开页面看效果：

是我们希望看到的：

上面链接前面title中没有空格等特殊字符，所以不合并。

后面3个符合条件，则合并。

另外

处理之前就已经是合并好的，此处也不符合，isNotSameUrl

所以不动。

【后记 20201125】

折腾：

【未解决】用Python发布印象笔记帖子内容到WordPress网站

期间，又多次遇到，代码需要更新

def mergeNoteTitleAndUrl(noteDetail):
    """Merge title and url in note content


    Args:
        noteDetail (Note): evernote note with detail
    Returns:
        updated Note
    Raises:
    """
    curContent = noteDetail.content
    logging.debug("curContent=%s", curContent)
    soup = BeautifulSoup(curContent, 'html.parser')


    aNodeList = soup.find_all("a", attrs={"href": True})
    for eachANode in aNodeList:
        # prevSiblingList = eachANode.find_previous_siblings()
        # nextSiblingList = eachANode.find_next_siblings()
        prevSiblingGenerator = eachANode.previous_siblings
        prevSiblingList = list(prevSiblingGenerator)
        nextSiblingGenerator = eachANode.next_siblings
        nextSiblingList = list(nextSiblingGenerator)
        if prevSiblingList or nextSiblingList:
            # <a> node is link title, and prev (or next) should be the link title
            # so if NO prev (or next), means NO exist title and link to merge
            continue


        aStr = eachANode.string
        if not aStr:
            continue
        aStr = aStr.strip()
        # '怎么关闭QQ浏览器的手势快捷键-百度经验'
        if not aStr:
            continue


        hrefValue = eachANode["href"]
        # 'https://jingyan.baidu.com/article/7f766daf6eeea40000e1d026.html'
        if not hrefValue:
            continue


        """
        <div>
            Python Mirror
        </div>
        <div>
            <a href="https://npm.taobao.org/mirrors/python/">
                https://npm.taobao.org/mirrors/python/
            </a>
        </div>
        """
        # <div><a href="https://npm.taobao.org/mirrors/">https://npm.taobao.org/mirrors/</a></div>
        # <div><a href="https://npm.taobao.org/mirrors/python/">https://npm.taobao.org/mirrors/python/</a></div>
        # if hrefValue != aStr:
        hrefP = "(https?://)?%s/?" % aStr
        isSameUrl = re.match(hrefP, hrefValue, re.I)
        isNotSameUrl = not isSameUrl
        if isNotSameUrl:
            # (1) has add link into title:
            #   <a href="https://jingyan.baidu.com/article/7f766daf6eeea40000e1d026.html">怎么关闭QQ浏览器的手势快捷键-百度经验</a>
            #   <div><a href="https://juejin.im/post/5c739c86e51d45699514ee0c">MacOS 下利用 pyenv 管理Python 版本和虚拟环境 - 掘金</a></div>
            # (2) <div><a href="http://mirrors.aliyun.com/">mirrors.aliyun.com</a></div>
            #   -> '(https?://)?mirrors.aliyun.com/?' == 'http://mirrors.aliyun.com/'
            continue


        isCurNoChild = isNoMoreChildren(eachANode)
        isCurHasChild = not isCurNoChild
        if isCurHasChild:
            continue


        # only one parent: div
        # parentDivNode = eachANode.find_parent("div")
        parentDivNode = eachANode.parent
        if not parentDivNode:
            continue


        # parent prev is div
        # parentPrevSibling = parentDivNode.find_previous_sibling()
        parentPrevSibling = parentDivNode.previous_sibling
        isParentPrevSiblingNotExist = not parentPrevSibling
        if isParentPrevSiblingNotExist:
            continue


        isParentPrevSiblingNameNotDiv = parentPrevSibling.name != "div"
        if isParentPrevSiblingNameNotDiv:
            continue


        parentPrevSiblingStr = parentPrevSibling.string
        isParentPrevSiblingStrEmpty = not parentPrevSiblingStr
        if isParentPrevSiblingStrEmpty:
            continue


        isParentPrevSiblingNoChild = isNoMoreChildren(parentPrevSibling)
        isParentPrevSiblingHasChild = not isParentPrevSiblingNoChild
        if isParentPrevSiblingHasChild:
            continue


        # other possible logic check
        # (1) title best contain some char: ' ' or '|' or '-'
        foundSpecialCharInTitle = re.search("[ \|\—]", parentPrevSiblingStr)
        isTitleNoSpecialChar = not foundSpecialCharInTitle
        if isTitleNoSpecialChar:
            continue


        # match all condition -> merge title and url
        # delete div
        parentPrevSibling.decompose()
        # replace new a node
        eachANode.string = parentPrevSiblingStr


    updatedContent = soup.prettify()
    # updatedContent = str(soup)
    logging.info("updatedContent=%s", updatedContent)


    noteDetail.content = updatedContent


    return noteDetail

结果报错：

【已解决】Python中用正则re.match报错：发生异常 error multiple repeat at position

转载请注明：在路上 » 【已解决】用Python把印象笔记中标题和链接合并一起

【已解决】用Python把印象笔记中标题和链接合并一起

与本文相关的文章

Hi，您需要填写昵称和邮箱！