【已解决】WordPress Importer导入xml文件时,无法识别导入的作者Author
【背景】
在用python脚本导出对应wxr格式的xml文件后,然后去wordpress中使用WordPress Importer导入,结果选择文件后,第二步中,显示出来的内容,是无法识别出author信息,接下来的导入过程,好几十篇文章,会有好几篇会出错,而无法导入。
但是很奇怪的是,对于xml文件中的author,都是和其他另外几个xml文件是一样的,而其他几个xml文件,都是可以正常导入的。
所以,看来是这个xml文件很特殊,以至于无法导入。
【解决过程】
1.记录下来了那几篇导入失败的文章是:
Failed to import 文章 “USB基础知识概论 v0.5.pdf”
Failed to import 文章 “【转】各种嵌入式软硬件公司的简介”
Failed to import 文章 “【转】UNIX/LINUX 平台可执行文件格式分析”
Failed to import 文章 “键盘Keyboard中的扫描码Scan Code 通码Make code 断码Break Code”
Failed to import 文章 “【转】字符编码笔记:ASCII,Unicode和UTF-8”
Failed to import 文章 “字符编码标准及存储交换标准”
Failed to import 文章 “【转】怎样花两年时间去面试一个人”
Failed to import 文章 “【to learn】关于实时自动化相关学习知识的网站:Real Time Automation”
Failed to import 文章 “【已解决】运行wp-admin/install.php去安装wordpress,出错:您的 PHP 似乎没有安装运行 WordPress 所必需的 MySQL 扩展。”
Failed to import 文章 “记录 wordpress折腾过程”
Failed to import 文章 “【已解决】python中文字符乱码(GB2312,GBK,GB18030相关的问题)”
Failed to import 文章 “【已解决】Python中,将一个字符串eval或ast.literal_eval变成字典后,unicode的字符变成了\x格式”
Failed to import 文章 “【已解决】wordpress中,url地址包含的中文,虽然已经过urllib.quote解析过了,但是却还是访问出错:403 Forbidden”
Failed to import 文章 “【整理】wordpress中的代码/语法高亮插件:WP SyntaxHighlighter”
Failed to import 文章 “【已解决】想要通过分析网易博客的html源码,以得知网易是如何获得一个帖子的评论的”
对应的去看了下,发现这几篇文章,多数都是内容很长,所以猜测,会不会是由于wordpress中,不支持单篇文章内容太长,就像很多博客,比如百度空间,网易博客等,单篇日志内容超过一定长度,就不给发表了。
所以,尝试了下,去把xml中,这几篇内容,都去除掉绝大部分,保证内容很少,结果再去导入,还是失败了。
相关的失败信息是:
Import WordPress
Failed to import author . Their posts will be attributed to the current user.
和:
Import WordPress
Failed to import author . Their posts will be attributed to the current user.
Failed to import “”: Invalid post type
All done. Have fun!
2.中间折腾了几次,尝试了把那几篇文章,单独的拷贝出来,加上对应的wxr的头,保证是有效的wxr文件,然后再去单独导入,好像是可以的,这就说明,这些文章,是有效的。但是不知道,为何那个blog_163_[againinput4]_20120102_1812-2.xml还是不能识别出author,还是会导入失败。
3.网上搜了下,关于使用WordPress Importer过程中,遇到无法识别author的情况,好像其他人没遇到这样的问题。
不过倒是在wordpress的关于wordpress-importer的论坛中:
http://wordpress.org/tags/wordpress-importer?forum_id=10
找到了这个帖子:
[Plugin: WordPress Importer] wordpress importer problem: all posts’ author become admin
其中老外说到了,启用WordPress Importer的debug功能,即
wp-contentpluginswordpress-importerwordpress-importer.php的第17行,把
define( ‘IMPORT_DEBUG’, false );
改为:
define( ‘IMPORT_DEBUG’, true );
然后去重新启动apache,重新退出再登陆wordpress。
再去导入,结果可以看到详细的导入过程的信息了:
88:1615 CData section not finished 88:1615 PCDATA invalid Char value 16 88:1624 Entity ‘nbsp’ not defined 88:1630 Entity ‘nbsp’ not defined 88:1643 PCDATA invalid Char value 16 88:1672 PCDATA invalid Char value 16 88:1725 PCDATA invalid Char value 5 88:1886 PCDATA invalid Char value 3 88:1887 PCDATA invalid Char value 4 88:2092 PCDATA invalid Char value 4 88:2102 Entity ‘nbsp’ not defined 88:2108 Entity ‘nbsp’ not defined 88:2114 Entity ‘nbsp’ not defined 88:2126 PCDATA invalid Char value 3 88:2156 PCDATA invalid Char value 16 88:2175 Entity ‘nbsp’ not defined 88:2176 PCDATA invalid Char value 4 88:2229 PCDATA invalid Char value 5 88:2235 PCDATA invalid Char value 17 88:2287 PCDATA invalid Char value 25 88:2311 PCDATA invalid Char value 3 88:2312 PCDATA invalid Char value 4 88:2603 PCDATA invalid Char value 3 88:2605 PCDATA invalid Char value 16 88:2613 PCDATA invalid Char value 5 88:2658 PCDATA invalid Char value 6 88:2666 PCDATA invalid Char value 25 88:2686 PCDATA invalid Char value 3 88:2687 PCDATA invalid Char value 4 88:2696 PCDATA invalid Char value 24 88:2714 Entity ‘nbsp’ not defined 88:2720 Entity ‘nbsp’ not defined 88:2726 Entity ‘nbsp’ not defined 88:2732 Entity ‘nbsp’ not defined 88:2738 Entity ‘nbsp’ not defined 88:2744 Entity ‘nbsp’ not defined 88:2750 Entity ‘nbsp’ not defined 88:3071 PCDATA invalid Char value 4 88:3085 PCDATA invalid Char value 3 88:3086 PCDATA invalid Char value 4 88:3212 PCDATA invalid Char value 16 88:3306 PCDATA invalid Char value 24 88:3388 PCDATA invalid Char value 3 88:3389 PCDATA invalid Char value 4 88:3395 Opening and ending tag mismatch: encoded line 88 and p 88:3401 Opening and ending tag mismatch: item line 81 and pre 88:4590 Entity ‘nbsp’ not defined 88:4596 Entity ‘nbsp’ not defined 88:4602 Entity ‘nbsp’ not defined 88:4608 Entity ‘nbsp’ not defined 88:4614 Entity ‘nbsp’ not defined 88:4620 Entity ‘nbsp’ not defined 88:4626 Entity ‘nbsp’ not defined 88:4686 Entity ‘nbsp’ not defined 88:4692 Entity ‘nbsp’ not defined 88:4698 Entity ‘nbsp’ not defined 88:4704 Entity ‘nbsp’ not defined 88:4710 Entity ‘nbsp’ not defined 88:4716 Entity ‘nbsp’ not defined 88:4722 Entity ‘nbsp’ not defined 88:4795 Entity ‘nbsp’ not defined 88:4801 Entity ‘nbsp’ not defined 88:4807 Entity ‘nbsp’ not defined 88:4813 Entity ‘nbsp’ not defined 88:4819 Entity ‘nbsp’ not defined 88:4825 Entity ‘nbsp’ not defined 88:4831 Entity ‘nbsp’ not defined 88:4895 Entity ‘nbsp’ not defined 88:4901 Entity ‘nbsp’ not defined 88:4907 Entity ‘nbsp’ not defined 88:4913 Entity ‘nbsp’ not defined 88:4919 Entity ‘nbsp’ not defined 88:4925 Entity ‘nbsp’ not defined 88:4931 Entity ‘nbsp’ not defined 88:4993 Entity ‘nbsp’ not defined 88:4999 Entity ‘nbsp’ not defined 88:5005 Entity ‘nbsp’ not defined 88:5011 Entity ‘nbsp’ not defined 88:5017 Entity ‘nbsp’ not defined 88:5023 Entity ‘nbsp’ not defined 88:5029 Entity ‘nbsp’ not defined 88:5102 Entity ‘nbsp’ not defined 88:5108 Entity ‘nbsp’ not defined 88:5114 Entity ‘nbsp’ not defined 88:5120 Entity ‘nbsp’ not defined 88:5126 Entity ‘nbsp’ not defined 88:5132 Entity ‘nbsp’ not defined 88:5138 Entity ‘nbsp’ not defined 88:5267 Entity ‘nbsp’ not defined 88:6279 Entity ‘nbsp’ not defined 88:7389 Entity ‘nbsp’ not defined 88:8733 Entity ‘nbsp’ not defined 88:10637 Entity ‘nbsp’ not defined 88:10699 Entity ‘nbsp’ not defined 88:10769 Entity ‘nbsp’ not defined 88:10835 Entity ‘nbsp’ not defined 88:10901 Entity ‘nbsp’ not defined 88:10949 Entity ‘nbsp’ not defined 88:10978 Entity ‘nbsp’ not defined 88:11071 Entity ‘nbsp’ not defined 88:11077 Entity ‘nbsp’ not defined 88:11089 Entity ‘nbsp’ not defined 88:11205 Entity ‘nbsp’ not defined 88:11282 Entity ‘nbsp’ not defined 88:11364 Entity ‘nbsp’ not defined 88:11888 Entity ‘nbsp’ not defined 88:12237 Entity ‘nbsp’ not defined 88:12708 Entity ‘nbsp’ not defined 88:12777 Entity ‘nbsp’ not defined 88:13282 Entity ‘nbsp’ not defined 88:13288 Entity ‘nbsp’ not defined 88:13294 Entity ‘nbsp’ not defined 88:13300 Entity ‘nbsp’ not defined 88:13306 Entity ‘nbsp’ not defined 88:13312 Entity ‘nbsp’ not defined 88:13318 Entity ‘nbsp’ not defined 88:13324 Entity ‘nbsp’ not defined 88:13330 Entity ‘nbsp’ not defined 88:13336 Entity ‘nbsp’ not defined 88:13424 Entity ‘nbsp’ not defined 88:13512 Entity ‘nbsp’ not defined 88:13518 Entity ‘nbsp’ not defined 88:13524 Entity ‘nbsp’ not defined 88:13530 Entity ‘nbsp’ not defined 88:13536 Entity ‘nbsp’ not defined 88:13542 Entity ‘nbsp’ not defined 88:13548 Entity ‘nbsp’ not defined 88:13554 Entity ‘nbsp’ not defined 88:13560 Entity ‘nbsp’ not defined 88:13566 Entity ‘nbsp’ not defined 88:13572 Entity ‘nbsp’ not defined 88:13578 Entity ‘nbsp’ not defined 88:13584 Entity ‘nbsp’ not defined 88:13590 Entity ‘nbsp’ not defined 88:13596 Entity ‘nbsp’ not defined 88:13602 Entity ‘nbsp’ not defined 88:13608 Entity ‘nbsp’ not defined 88:13614 Entity ‘nbsp’ not defined 88:15521 Sequence ‘]]>’ not allowed in content 88:15521 internal error There was an error when reading this WXR file
Assign Authors If a new user is created by WordPress, a new password will be randomly generated and the new user’s role will be set as subscriber. Manually changing the new user’s details will be necessary. Import author: () Download and import file attachments |
可以看到,提示内容中说道了,是其中这篇帖子“【记录】DocBook开发过程 – 2”中,包含了一些字符,无法识别。
然后就去看到,到底该帖子,是何内容:
原先网易原帖:
http://againinput4.blog.163.com/blog/static/172799491201110111145259/
中的内容是:

对应的xml中的内容是:

可以看到xml中,是包含了一些ascii的控制字符,DLE=16,EQT=4,等等,这也就是上面WordPress Importer导入过程中提示的:
1213:2156 PCDATA invalid Char value 16,1213:2176 PCDATA invalid Char value 4。
所以,去xml中,把上述这些ASCII控制字符删除掉,再去重新导入,就可以正常识别作者是crifan,可以正常的导入所有的文章了。
【总结】
WordPress Importer导入WXR格式的xml文件的过程中,如果出现无法识别作者等类似错误信息,可以去
wp-contentpluginswordpress-importerwordpress-importer.php的第17行,把
define( ‘IMPORT_DEBUG’, false );
改为:
define( ‘IMPORT_DEBUG’, true );
即打开debug功能,这样就可以看到详细的导入过程中,到底发生了啥,然后找到问题的原因所在,才能对症下药的去解决问题。
【题外话】
感慨一句,对于wordpress,或者说至少是对于WordPress Importer,写这些代码的人,的确是高手啊。
把对应的debug等功能,都做的如此好,使得即使出了问题,打开debug,就很容易找到到底出了啥问题。
另外想感慨的是,开源的世界里,真的是很爽。一切问题,只要有源码,都是可以解决的。即使不能解决,那也是自己知识不够,但是可以让更懂的人,去帮你解决。
所以,所有内容总结成那句经典的话:
有问题? looking the fucking code !
【提示】
关于ASCII的控制字符,不了解的可以去看:
ASCII字符集中的功能/控制字符
http://bbs.chinaunix.net/thread-3608423-1-1.html
【后记 2012-01-08】
后来又去测试了,对于wordpress importer到底支持哪些ascii的控制字符,
然后就插入了所有的ASCII的控制字符,用于测试:

将其导入进去,结果是:
对于测试所有的ascii的控制字符0-0x20和0x7F,结果是:
89:7 Char 0x0 out of allowed range
89:7 CData section not finished
All ASCII control char test
00 N
89:7 Premature end of data in tag encoded line 88
89:7 Premature end of data in tag item line 81
89:7 Premature end of data in tag channel line 31
89:7 Premature end of data in tag rss line 23
There was an error when reading this WXR file
Details are shown above. The importer will now try again with a different parser…
去掉0x0=NULL之后,结果是:
89:7 CData section not finished
All ASCII control char test
01 S
89:7 PCDATA invalid Char value 1
90:7 PCDATA invalid Char value 2
91:7 PCDATA invalid Char value 3
92:7 PCDATA invalid Char value 4
93:7 PCDATA invalid Char value 5
94:7 PCDATA invalid Char value 6
95:7 PCDATA invalid Char value 7
96:6 PCDATA invalid Char value 8
100:6 PCDATA invalid Char value 11
101:6 PCDATA invalid Char value 12
103:6 PCDATA invalid Char value 14
104:6 PCDATA invalid Char value 15
105:7 PCDATA invalid Char value 16
106:7 PCDATA invalid Char value 17
107:7 PCDATA invalid Char value 18
108:7 PCDATA invalid Char value 19
109:7 PCDATA invalid Char value 20
110:7 PCDATA invalid Char value 21
111:7 PCDATA invalid Char value 22
112:7 PCDATA invalid Char value 23
113:7 PCDATA invalid Char value 24
114:6 PCDATA invalid Char value 25
115:7 PCDATA invalid Char value 26
116:7 PCDATA invalid Char value 27
117:6 PCDATA invalid Char value 28
118:6 PCDATA invalid Char value 29
119:6 PCDATA invalid Char value 30
120:6 PCDATA invalid Char value 31
123:1 Sequence ‘]]>’ not allowed in content
【结论】
证实了结果是,好像wordpress importer中,即wordpress本身博客系统中,只支持:
9=t=tab
10=n=LF=Line Feed=换行
13=r=CR=回车
32= =空格,
不支持:
0x0-0x19之间的剩下的那些,和0x7F=DEL=删除键