折腾:
【未解决】自己写Python脚本同步印象笔记到WordPress
期间,发现带缩进的帖子,发布后缩进丢失,所以先去:
【已解决】印象笔记帖子发布到WordPress后丢失缩进的原因
然后去写代码处理,不过此贴已发布,再去找别的带缩进的帖子供测试
找到2个:

和:

- 第二个缩进中:
- 文字还带加粗的
- 注意如何处理和支持
- 文字中还带title+link类型
- 缩进支持即可
- 之前的合并title+link,就不支持也无所谓了
去调试
发现此处,效果:

Evernote的html
<div>是对的</div> <ul> <li> <div>0 为集成显卡</div> </li> <li> <div>1 为独立显卡</div> </li> <li> <div>2 为自动切换</div> </li> </ul> <div><br /></div> <div>mac 插电/电池 自动切换显卡/独显 - 知乎</div> <div><a href="https://zhuanlan.zhihu.com/p/132679059">https://zhuanlan.zhihu.com/p/132679059</a></div> <div>也可以根据情况设置模式</div> <ul> <li> <div>sudo pmset -b GPUSwitch 0</div> </li> <ul> <li> <div>-b 表示 battery 为电池模式</div> </li> <ul> <li> <div>电池模式时:用过0 集成显卡 integrated graphics</div> </li> </ul> </ul> <li> <div>sudo pmset -c GPUSwitch 1</div> </li> <ul> <li> <div>-c 表示 charger 为电源模式</div> </li> <ul> <li> <div>电源模式时:用1=独立显卡 high performance graphic cards</div> </li> </ul> </ul> </ul>
去处理ul
先要搜索,最顶层的ul,即<en-note>下面的ul
其中涉及到:
【已解决】BeautifulSoup中只搜索当前直接子节点不搜索其他子孙节点
继续写代码处理
感觉处理逻辑是:
找到最底层元素:没有child了
然后判断当前是否是div,string是纯字符串
且div的parent是li,则把div去掉,只保留string给li即可
另外:后续 当遇到了 再去处理
当最底层div 不是纯string的情况
去试试
for curUlIdx, eachUlSoup in enumerate(directUlSoupList):
logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20)
logging.info("before eachUlSoup=%s", eachUlSoup)
removeDivInUl(eachUlSoup)
logging.info("after eachUlSoup=%s", eachUlSoup)
def removeDivInUl(curSoup):
"""Remove unuseful / redundant div node inside ul->li
"""
logging.info("parentSoup=%s", curSoup)
if curSoup.children:
# has child
for eachChild in curSoup.children:
removeDivInUl(eachChild)
else:
# no child
isSelfDiv = curSoup.name == "div"
# current is li ?
if isSelfDiv:
parentSoup = curSoup.parent
logging.info("parentSoup=%s", parentSoup)
if parentSoup:
isParentLi = parentSoup.name == "li"
if isParentLi:
parentSoup.string = curSoup.string
parentSoup.children = []
logging.info("parentSoup=%s", parentSoup)
return调试看看
期间:
【已解决】BeautifulSoup中如何实现子节点的内容替换
期间把:
20201201 05:07:00 EvernoteToWordpress.py:542 INFO before eachUlSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul>
变成了:
20201201 05:09:57 EvernoteToWordpress.py:544 INFO after eachUlSoup=<ul><li>0 为集成显卡</li><li>1 为独立显卡</li><li>2 为自动切换</li></ul>
是我们希望的。
继续调试
20201201 05:10:49 EvernoteToWordpress.py:541 INFO -------------------- 1 -------------------- 20201201 05:10:51 EvernoteToWordpress.py:542 INFO before eachUlSoup=<ul><li><div>sudo pmset -b GPUSwitch 0</div></li><ul><li><div>-b 表示 battery 为电池模式</div></li><ul><li><div>电池模式时:用过0 集成显卡 integrated graphics</div></li></ul></ul><li><div>sudo pmset -c GPUSwitch 1</div></li><ul><li><div>-c 表示 charger 为电源模式</div></li><ul><li><div>电源模式时:用1=独立显卡 high performance graphic cards</div></li></ul></ul></ul> 。。。 20201201 05:11:07 EvernoteToWordpress.py:544 INFO after eachUlSoup=<ul><li>sudo pmset -b GPUSwitch 0</li><ul><li>-b 表示 battery 为电池模式</li><ul><li>电池模式时:用过0 集成显卡 integrated graphics</li></ul></ul><li>sudo pmset -c GPUSwitch 1</li><ul><li>-c 表示 charger 为电源模式</li><ul><li>电源模式时:用1=独立显卡 high performance graphic cards</li></ul></ul></ul>
格式化html后效果对比

是希望的效果。
【总结】
然后看看最后处理结果
最后用代码:
# process list (ul/ol/...) indent
noteDetail = processListIndent(noteDetail)
def processListIndent(curNote):
"""process list (ul/ol/...) indent
Args:
curNote (Note): evernote Note
Returns:
Note
Raises:
"""
soup = utils.htmlToSoup(curNote.content)
enNoteSoup = soup.find("en-note")
# allSubUlSoupList = enNoteSoup.find_all("ul")
# allSubUlSoupNum = len(allSubUlSoupList)
# logging.info("Found %d all sub level ul list", allSubUlSoupNum)
directUlSoupList = enNoteSoup.find_all("ul", recursive=False)
directUlSoupNum = len(directUlSoupList)
logging.info("Found %d top level ul list", directUlSoupNum)
for curUlIdx, eachUlSoup in enumerate(directUlSoupList):
logging.info("%s %s %s", "-"*20, curUlIdx, "-"*20)
logging.info("before eachUlSoup=%s", eachUlSoup)
removeDivInUl(eachUlSoup)
logging.info("after eachUlSoup=%s", eachUlSoup)
# soup changed, write back to note content
updatedNoteHtml = crifanEvernote.soupToNoteContent(enNoteSoup)
curNote.content = updatedNoteHtml
return curNote
def htmlToSoup(curHtml):
"""convert html to soup
Args:
curHtml (str): html str
Returns:
soup
Raises:
"""
soup = BeautifulSoup(curHtml, 'html.parser')
return soup
def removeDivInUl(curSoup):
"""Remove unuseful / redundant div node inside ul->li
"""
curSoupType = type(curSoup)
logging.info("curSoupType=%s, curSoup=%s", curSoupType, curSoup)
# curSoupType=<class 'bs4.element.Tag'>, curSoup=<ul><li><div>0 为集成显卡</div></li><li><div>1 为独立显卡</div></li><li><div>2 为自动切换</div></li></ul>
if not isinstance(curSoup, Tag):
# curSoupType=<class 'bs4.element.NavigableString'>, curSoup=0 为集成显卡
return
childSoupList = list(curSoup.children)
if curSoup.name != "div":
# self is not div, just process each child
for eachChildSoup in childSoupList:
removeDivInUl(eachChildSoup)
return
parentSoup = curSoup.parent
logging.info("parentSoup=%s", parentSoup)
if parentSoup:
if parentSoup.name != "li":
# has parent, but no li
return
else:
# no parent?
logging.warning("to support")
# isOnlyChildNotSoup = False
# if childSoupList:
# childNum = len(childSoupList)
# if childNum == 1:
# onliyChildSoup = childSoupList[0]
# isOnlyChildIsSoup = isinstance(onliyChildSoup, Tag)
# if isOnlyChildIsSoup:
# # process it
# parentSoup.string = curSoup.string
# parentSoup.children = []
# logging.info("parentSoup=%s", parentSoup)
# else:
# logging.info("type(onliyChildSoup)=%s", type(onliyChildSoup))
# # type(onliyChildSoup)=<class 'bs4.element.NavigableString'>
# return
# else:
# for eachChildSoup in childSoupList:
# removeDivInUl(eachChildSoup)
# else:
# # no child
# logging.warning("to support")
# process it
# parentSoup.contents = curSoup.contents
# parentSoup.children = []
# curSoup.name = "li"
logging.info("before replace: curSoup=%s", curSoup)
logging.info("before replace: parentSoup=%s", parentSoup)
# curSoupCopy = copy.deepcopy(curSoup)
# parentSoup.replace_with(curSoupCopy)
# parentSoup.children = curSoup.children
# Prerequisite: li only have one div child !
parentSoup.div.unwrap()
logging.info("aftre replace: parentSoup=%s", parentSoup)
logging.info("aftre replace: curSoup=%s", curSoup)
return
@staticmethod
def soupToNoteContent(soup):
"""Convert BeautifulSoup Soup to Evernote Note content
Args:
soup (Soup): BeautifulSoup Soup
Returns:
Evernote Note content html(str)
Raises:
"""
noteContentHtml = utils.soupToHtml(soup)
noteContentHtml = crifanEvernote.convertToClosedEnMediaTag(noteContentHtml)
# add first line
# <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">
noteContentHtml = '<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\n' + noteContentHtml
return noteContentHtml
def soupToHtml(soup):
"""Convert soup to html string
Args:
soup (Soup): BeautifulSoup soup
Returns:
html (str)
Raises:
"""
curHtml = soup.prettify()
# curHtml = str(soup)
return curHtml
@staticmethod
def convertToClosedEnMediaTag(noteHtml):
"""Process note content html, for special </en-media> will cause error, so need convert:
<en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png"></en-media>
to closed en-media tag:
<en-media hash="7c54d8d29cccfcfe2b48dd9f952b715b" type="image/png" />
Args:
noteHtml (str): Note content html
Returns:
note content html with closed en-media tag (str)
Raises:
"""
noteHtml = re.sub("(?P<enMedia><en-media\s+[^<>]+)>\s*</en-media>", "\g<enMedia> />", noteHtml, flags=re.S)
return noteHtml把:

可以把:
<ul> <li> <div>sudo pmset -b GPUSwitch 0</div> </li> <ul> <li> <div>-b = battery =电池模式</div> </li> <ul> <li> <div>电池模式 用 0 集成显卡 integrated graphics</div> </li> </ul> </ul> <li> <div>sudo pmset -c GPUSwitch 1</div> </li> <ul> <li> <div>-c = charger = 电源模式</div> </li> <ul> <li> <div>电源模式 用 1 独立显卡 high performance graphic cards</div> </li> </ul> </ul> </ul>
变成了:
<ul> <li> sudo pmset -b GPUSwitch 0 </li> <ul> <li> -b = battery =电池模式 </li> <ul> <li> 电池模式 用 0 集成显卡 integrated graphics </li> </ul> </ul> <li> sudo pmset -c GPUSwitch 1 </li> <ul> <li> -c = charger = 电源模式 </li> <ul> <li> 电源模式 用 1 独立显卡 high performance graphic cards </li> </ul> </ul> </ul>
是符合预期的。
至此,即可保留缩进了。
【后记20201205】
不过后来还是缩进丢失,最后是找到是WordPress的配置导致的。
解决办法:设置-》撰写-》格式-》取消勾选:让WordPress自动校正嵌套错误的XHTML代码
详见:
【已解决】Python发布带缩进的html到WordPress后html被改变缩进丢失问题
【已解决】WordPress中的html为何会被改变导致ul+li的缩进丢失
【后记20201206】
某个帖子,更新后,缩进保留,没丢失的效果:
