折腾:
【未解决】自己写Python脚本同步印象笔记到WordPress
期间,发现带缩进的帖子,发布后缩进丢失,所以先去:
【已解决】印象笔记帖子发布到WordPress后丢失缩进的原因
然后去写代码处理,不过此贴已发布,再去找别的带缩进的帖子供测试
找到2个:
和:
- 第二个缩进中:
- 文字还带加粗的
- 注意如何处理和支持
- 文字中还带title+link类型
- 缩进支持即可
- 之前的合并title+link,就不支持也无所谓了
去调试
发现此处,效果:
Evernote的html
<div>是对的</div> <ul> <li> <div>0 为集成显卡</div> </li> <li> <div>1 为独立显卡</div> </li> <li> <div>2 为自动切换</div> </li> </ul> <div><br /></div> <div>mac 插电/电池 自动切换显卡/独显 - 知乎</div> <div><a href="https://zhuanlan.zhihu.com/p/132679059">https://zhuanlan.zhihu.com/p/132679059</a></div> <div>也可以根据情况设置模式</div> <ul> <li> <div>sudo pmset -b GPUSwitch 0</div> </li> <ul> <li> <div>-b 表示 battery 为电池模式</div> </li> <ul> <li> <div>电池模式时:用过0 集成显卡 integrated graphics</div> </li> </ul> </ul> <li> <div>sudo pmset -c GPUSwitch 1</div> </li> <ul> <li> <div>-c 表示 charger 为电源模式</div> </li> <ul> <li> <div>电源模式时:用1=独立显卡 high performance graphic cards</div> </li> </ul> </ul> </ul>
去处理ul
先要搜索,最顶层的ul,即<en-note>下面的ul
其中涉及到:
【已解决】BeautifulSoup中只搜索当前直接子节点不搜索其他子孙节点
继续写代码处理
感觉处理逻辑是:
找到最底层元素:没有child了
然后判断当前是否是div,string是纯字符串
且div的parent是li,则把div去掉,只保留string给li即可
另外:后续 当遇到了 再去处理
当最底层div 不是纯string的情况
去试试
for curUlIdx, eachUlSoup in enumerate(directUlSoupList): logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20) logging.info("before eachUlSoup=%s", eachUlSoup) removeDivInUl(eachUlSoup) logging.info("after eachUlSoup=%s", eachUlSoup) def removeDivInUl(curSoup): """Remove unuseful / redundant div node inside ul->li """ logging.info("parentSoup=%s", curSoup) if curSoup.children: # has child for eachChild in curSoup.children: removeDivInUl(eachChild) else: # no child isSelfDiv = curSoup.name == "div" # current is li ? if isSelfDiv: parentSoup = curSoup.parent logging.info("parentSoup=%s", parentSoup) if parentSoup: isParentLi = parentSoup.name == "li" if isParentLi: parentSoup.string = curSoup.string parentSoup.children = [] logging.info("parentSoup=%s", parentSoup) return
调试看看
期间:
【已解决】BeautifulSoup中如何实现子节点的内容替换
期间把:
20201201 05:07:00 EvernoteToWordpress.py:542 INFO before eachUlSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul>
变成了:
20201201 05:09:57 EvernoteToWordpress.py:544 INFO after eachUlSoup=<ul><li>0 为集成显卡</li><li>1 为独立显卡</li><li>2 为自动切换</li></ul>
是我们希望的。
继续调试
20201201 05:10:49 EvernoteToWordpress.py:541 INFO -------------------- 1 -------------------- 20201201 05:10:51 EvernoteToWordpress.py:542 INFO before eachUlSoup=<ul><li><div>sudo pmset -b GPUSwitch 0</div></li><ul><li><div>-b 表示 battery 为电池模式</div></li><ul><li><div>电池模式时:用过0 集成显卡 integrated graphics</div></li></ul></ul><li><div>sudo pmset -c GPUSwitch 1</div></li><ul><li><div>-c 表示 charger 为电源模式</div></li><ul><li><div>电源模式时:用1=独立显卡 high performance graphic cards</div></li></ul></ul></ul> 。。。 20201201 05:11:07 EvernoteToWordpress.py:544 INFO after eachUlSoup=<ul><li>sudo pmset -b GPUSwitch 0</li><ul><li>-b 表示 battery 为电池模式</li><ul><li>电池模式时:用过0 集成显卡 integrated graphics</li></ul></ul><li>sudo pmset -c GPUSwitch 1</li><ul><li>-c 表示 charger 为电源模式</li><ul><li>电源模式时:用1=独立显卡 high performance graphic cards</li></ul></ul></ul>
格式化html后效果对比
是希望的效果。
【总结】
然后看看最后处理结果
最后用代码:
# process list (ul/ol/...) indent noteDetail = processListIndent(noteDetail) def processListIndent(curNote): """process list (ul/ol/...) indent Args: curNote (Note): evernote Note Returns: Note Raises: """ soup = utils.htmlToSoup(curNote.content) enNoteSoup = soup.find("en-note") # allSubUlSoupList = enNoteSoup.find_all("ul") # allSubUlSoupNum = len(allSubUlSoupList) # logging.info("Found %d all sub level ul list", allSubUlSoupNum) directUlSoupList = enNoteSoup.find_all("ul", recursive=False) directUlSoupNum = len(directUlSoupList) logging.info("Found %d top level ul list", directUlSoupNum) for curUlIdx, eachUlSoup in enumerate(directUlSoupList): logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20) logging.info("before eachUlSoup=%s", eachUlSoup) removeDivInUl(eachUlSoup) logging.info("after eachUlSoup=%s", eachUlSoup) # soup changed, write back to note content updatedNoteHtml = crifanEvernote.soupToNoteContent(enNoteSoup) curNote.content = updatedNoteHtml return curNote def htmlToSoup(curHtml): """convert html to soup Args: curHtml (str): html str Returns: soup Raises: """ soup = BeautifulSoup(curHtml, 'html.parser') return soup def removeDivInUl(curSoup): """Remove unuseful / redundant div node inside ul->li """ curSoupType = type(curSoup) logging.info("curSoupType=%s, curSoup=%s", curSoupType, curSoup) # curSoupType=<class 'bs4.element.Tag'>, curSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul> if not isinstance(curSoup, Tag): # curSoupType=<class 'bs4.element.NavigableString'>, curSoup=0 为集成显卡 return childSoupList = list(curSoup.children) if curSoup.name != "div": # self is not div, just process each child for eachChildSoup in childSoupList: removeDivInUl(eachChildSoup) return parentSoup = curSoup.parent logging.info("parentSoup=%s", parentSoup) if parentSoup: if parentSoup.name != "li": # has parent, but no li return else: # no parent? logging.warning("to support") # isOnlyChildNotSoup = False # if childSoupList: # childNum = len(childSoupList) # if childNum == 1: # onliyChildSoup = childSoupList[0] # isOnlyChildIsSoup = isinstance(onliyChildSoup, Tag) # if isOnlyChildIsSoup: # # process it # parentSoup.string = curSoup.string # parentSoup.children = [] # logging.info("parentSoup=%s", parentSoup) # else: # logging.info("type(onliyChildSoup)=%s", type(onliyChildSoup)) # # type(onliyChildSoup)=<class 'bs4.element.NavigableString'> # return # else: # for eachChildSoup in childSoupList: # removeDivInUl(eachChildSoup) # else: # # no child # logging.warning("to support") # process it # parentSoup.contents = curSoup.contents # parentSoup.children = [] # curSoup.name = "li" logging.info("before replace: curSoup=%s", curSoup) logging.info("before replace: parentSoup=%s", parentSoup) # curSoupCopy = copy.deepcopy(curSoup) # parentSoup.replace_with(curSoupCopy) # parentSoup.children = curSoup.children # Prerequisite: li only have one div child ! parentSoup.div.unwrap() logging.info("aftre replace: parentSoup=%s", parentSoup) logging.info("aftre replace: curSoup=%s", curSoup) return @staticmethod def soupToNoteContent(soup): """Convert BeautifulSoup Soup to Evernote Note content Args: soup (Soup): BeautifulSoup Soup Returns: Evernote Note content html(str) Raises: """ noteContentHtml = utils.soupToHtml(soup) noteContentHtml = crifanEvernote.convertToClosedEnMediaTag(noteContentHtml) # add first line # <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd"> noteContentHtml = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\n' + noteContentHtml return noteContentHtml def soupToHtml(soup): """Convert soup to html string Args: soup (Soup): BeautifulSoup soup Returns: html (str) Raises: """ curHtml = soup.prettify() # curHtml = str(soup) return curHtml @staticmethod def convertToClosedEnMediaTag(noteHtml): """Process note content html, for special </en-media> will cause error, so need convert: <en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png"></en-media> to closed en-media tag: <en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png" /> Args: noteHtml (str): Note content html Returns: note content html with closed en-media tag (str) Raises: """ noteHtml = re.sub("(?P<enMedia><en-media\s+[^<>]+)>\s*</en-media>", "\g<enMedia> />", noteHtml, flags=re.S) return noteHtml
把:
可以把:
<ul> <li> <div>sudo pmset -b GPUSwitch 0</div> </li> <ul> <li> <div>-b = battery =电池模式</div> </li> <ul> <li> <div>电池模式 用 0 集成显卡 integrated graphics</div> </li> </ul> </ul> <li> <div>sudo pmset -c GPUSwitch 1</div> </li> <ul> <li> <div>-c = charger = 电源模式</div> </li> <ul> <li> <div>电源模式 用 1 独立显卡 high performance graphic cards</div> </li> </ul> </ul> </ul>
变成了:
<ul> <li> sudo pmset -b GPUSwitch 0 </li> <ul> <li> -b = battery =电池模式 </li> <ul> <li> 电池模式 用 0 集成显卡 integrated graphics </li> </ul> </ul> <li> sudo pmset -c GPUSwitch 1 </li> <ul> <li> -c = charger = 电源模式 </li> <ul> <li> 电源模式 用 1 独立显卡 high performance graphic cards </li> </ul> </ul> </ul>
是符合预期的。
至此,即可保留缩进了。
【后记20201205】
不过后来还是缩进丢失,最后是找到是WordPress的配置导致的。
解决办法:设置-》撰写-》格式-》取消勾选:让WordPress自动校正嵌套错误的XHTML代码
详见:
【已解决】Python发布带缩进的html到WordPress后html被改变缩进丢失问题
【已解决】WordPress中的html为何会被改变导致ul+li的缩进丢失
【后记20201206】
某个帖子,更新后,缩进保留,没丢失的效果: