最新消息:20210917 已从crifan.com换到crifan.org

【教程】用Python的codecs处理各种字符编码的字符串和文件

Python crifan 8634浏览 0评论

【背景】

之前就遇到很多次,对于将,不仅仅是普通的ASCII的字符串,读取或写入文件

之前也就知道用codecs。

后来见到不止一人:

【问题解答】python爬虫保存为txt的编码问题求解决?

python怎么读取文件名中包含特殊字符的文件 比如xiân.txt

遇到类似问题,但是不会处理,所以,此处,专门去写个教程,简要解释一下codecs如何使用。

【Python中用codecs处理各种字符编码的文件】

完整示例代码如下:

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【教程】用Python的codecs处理各种字符编码的字符串和文件
【教程】用Python的codecs处理各种字符编码的字符串和文件
Author: Crifan Li Version: 2013-10-20 Contact: https://www.crifan.org/about/me """ import codecs; def python_codecs_demo(): """demo how to use codecs to handle file with specific encoding""" testStrUnicode = u"中文测试Unicode字符串"; print "testStrUnicode=",testStrUnicode testStrUtf8 = testStrUnicode.encode("UTF-8"); testStrGbk = testStrUnicode.encode("GBK"); outputFilename = "outputFile.txt" print "------------ 1.UTF-8 write and read ------------" print "--- (1) write UTF-8 string into file ---" # 'a+': read,write,append # 'w' : clear before, then write outputFp = codecs.open(outputFilename, 'w'); outputFp.write(testStrUtf8); outputFp.flush(); outputFp.close(); print "--- (2) read out previously written UTF-8 content ---" readoutFp = codecs.open(outputFilename, 'r', 'UTF-8'); #here already is unicode, for we have pass "UTF-8" to codecs.open readOutStrUnicodeFromUtf8 = readoutFp.read() readoutFp.close(); print "readOutStrUnicodeFromUtf8=",readOutStrUnicodeFromUtf8 print "------------ 2.GBK write and read ------------" print "--- (1) write GBK string into file ---" # 'a+': read,write,append # 'w' : clear before, then write outputFp = codecs.open(outputFilename, 'w'); outputFp.write(testStrGbk); outputFp.flush(); outputFp.close(); print "--- (2) read out previously written GBK content ---" readoutFp = codecs.open(outputFilename, 'r', 'GBK'); #here already is unicode, for we have pass "GBK" to codecs.open readOutStrUnicodeFromGbk = readoutFp.read() readoutFp.close(); print "readOutStrUnicodeFromGbk=",readOutStrUnicodeFromGbk print "Note: " print "1. more about encoding, please refer:" print u"【详解】python中的文件操作模式" print u"https://www.crifan.org/summary_python_file_operation_mode/" if __name__ == "__main__": python_codecs_demo()

输出为:

E:\dev_root\python\tutorial_summary\python_codecs_demo>python_codecs_demo.py

testStrUnicode= 中文测试Unicode字符串

———— 1.UTF-8 write and read ————

— (1) write UTF-8 string into file —

— ()2) read out previously written UTF-8 content —

readOutStrUnicodeFromUtf8= 中文测试Unicode字符串

———— 2.GBK write and read ————

— (1) write GBK string into file —

— (2) read out previously written GBK content —

readOutStrUnicodeFromGbk= 中文测试Unicode字符串

Note:

1. more about encoding, please refer:

【详解】python中的文件操作模式

https://www.crifan.org/summary_python_file_operation_mode/

如图:

summary_python_file_operation_mode output

 

注:

1.关于字符编码,不熟悉的可参考:

字符编码详解

2.关于文件操作模式,不熟悉的可参考:

【详解】python中的文件操作模式

3.关于Python中的字符串编码,不熟悉的可参考:

Python专题教程:字符串和字符编码

 

【总结】

还是要多参考官网的api的解释,多练习,才会慢慢的真正理解编码的事情。

转载请注明:在路上 » 【教程】用Python的codecs处理各种字符编码的字符串和文件

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (2)

  1. 擷取文字未現視如何處理 import urllib import re urls= ["http://www.poi86.com/poi/9145702.html","http://www.poi86.com/poi/25277218.html"] i=0 regex1 ='大地坐标: (.+?)' regex2 ='(.+?)' regex3 ='详细地址: (.+?)' pattern1 = re.compile(regex1) pattern2 = re.compile(regex2) pattern3 = re.compile(regex3) while i< len(urls): htmlfile = urllib.urlopen(urls[i]) htmltext = htmlfile.read() latlong = re.findall(pattern1,htmltext) address1 = re.findall(pattern2,htmltext) address2 = re.findall(pattern3,htmltext) data = latlong,address1,address2 print data i+=1
    ho7年前 (2016-04-02)回复
96 queries in 0.156 seconds, using 21.74MB memory