之前已经介绍过了网络相关的一些基础知识了:
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
以及,简单的网页内容抓取,用Python是如何实现的:
现在接着来介绍,如何通过Python来实现基本的模拟网站登陆的流程。
不过,此处需要介绍一下此文前提:
假定你已经看完了:
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
了解了基本的网络相关基本概念;
看完了:
【总结】浏览器中的开发人员工具(IE9的F12和Chrome的Ctrl+Shift+I)-网页分析的利器
知道了如何使用IE9的F12等工具去分析网页执行的过程。
此处已模拟登陆百度首页:
为例,说明如何通过Python模拟登陆网站。
1.模拟登陆网站之前,需要搞清楚,登陆该网站的内部执行逻辑
此想要通过程序,python代码,实现模拟登陆百度首页之前。
你自己本身先要搞懂,本身登陆该网站,内部的逻辑是什么样的。
而关于如何利用工具,分析出来,百度首页登录的内部逻辑过程,参见:
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
2.然后才是用对应的语言,此处是Python去实现,模拟登陆的逻辑
看懂了上述用F12分析出来的百度首页的登陆的内部逻辑过程,接下来,用Python代码去实现,相对来说,就不是很难了。
注:
(1)关于在Python中如何利用cookie,不熟悉的,先去看:
【已解决】Python中如何获得访问网页所返回的cookie
【已解决】Python中实现带Cookie的Http的Post请求
(2)对于正则表达式不熟悉的,去参考:
(3)对python的正则表达式不熟悉的,可参考:
此处,再把分析出来的流程,贴出来,以便方便和代码对照:
| 顺序 | 访问地址 | 访问类型 | 发送的数据 | 需要获得/提取的返回的值 |
| 1 | http://www.baidu.com/ | GET | 无 | 返回的cookie中的BAIDUID |
| 2 | https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true | GET | 包含BAIDUID这个cookie | 从返回的html中提取出token的值 |
| 3 | https://passport.baidu.com/v2/api/?login | POST | 一堆的post data,其中token的值是之前提取出来的 | 需要验证返回的cookie中,是否包含BDUSS,PTOKEN,STOKEN,SAVEUSERID |
然后,最终就可以写出相关的,用于演示模拟登录百度首页的Python代码了。
【版本1:Python实现模拟登陆百度首页的完整代码 之 精简版】
这个是相对精简的一个版本:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/
Note: Before try to understand following code, firstly, please read the related articles:
(1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
(2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
(3) 【教程】模拟登陆网站 之 Python版
【教程】模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
Version: 2012-11-06
Author: Crifan
"""
import re;
import cookielib;
import urllib;
import urllib2;
import optparse;
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def checkAllCookiesExist(cookieNameList, cookieJar) :
cookiesDict = {};
for eachCookieName in cookieNameList :
cookiesDict[eachCookieName] = False;
allCookieFound = True;
for cookie in cookieJar :
if(cookie.name in cookiesDict) :
cookiesDict[cookie.name] = True;
for eachCookie in cookiesDict.keys() :
if(not cookiesDict[eachCookie]) :
allCookieFound = False;
break;
return allCookieFound;
#------------------------------------------------------------------------------
# just for print delimiter
def printDelimiter():
print '-'*80;
#------------------------------------------------------------------------------
# main function to emulate login baidu
def emulateLoginBaidu():
print "Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/";
print "Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword";
printDelimiter();
# parse input parameters
parser = optparse.OptionParser();
parser.add_option("-u","--username",action="store",type="string",default='',dest="username",help="Your Baidu Username");
parser.add_option("-p","--password",action="store",type="string",default='',dest="password",help="Your Baidu password");
(options, args) = parser.parse_args();
# export all options variables, then later variables can be used
for i in dir(options):
exec(i + " = options." + i);
printDelimiter();
print "[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies";
cj = cookielib.CookieJar();
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
urllib2.install_opener(opener);
printDelimiter();
print "[step1] to get cookie BAIDUID";
baiduMainUrl = "http://www.baidu.com/";
resp = urllib2.urlopen(baiduMainUrl);
#respInfo = resp.info();
#print "respInfo=",respInfo;
for index, cookie in enumerate(cj):
print '[',index, ']',cookie;
printDelimiter();
print "[step2] to get token value";
getapiUrl = "https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true";
getapiResp = urllib2.urlopen(getapiUrl);
#print "getapiResp=",getapiResp;
getapiRespHtml = getapiResp.read();
#print "getapiRespHtml=",getapiRespHtml;
#bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
foundTokenVal = re.search("bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';", getapiRespHtml);
if(foundTokenVal):
tokenVal = foundTokenVal.group("tokenVal");
print "tokenVal=",tokenVal;
printDelimiter();
print "[step3] emulate login baidu";
staticpage = "http://www.baidu.com/cache/user/html/jump.html";
baiduMainLoginUrl = "https://passport.baidu.com/v2/api/?login";
postDict = {
#'ppui_logintime': "",
'charset' : "utf-8",
#'codestring' : "",
'token' : tokenVal, #de3dbf1e8596642fa2ddf2921cd6257f
'isPhone' : "false",
'index' : "0",
#'u' : "",
#'safeflg' : "0",
'staticpage' : staticpage, #http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
'loginType' : "1",
'tpl' : "mn",
'callback' : "parent.bdPass.api.login._postCallback",
'username' : username,
'password' : password,
#'verifycode' : "",
'mem_pass' : "on",
};
postData = urllib.urlencode(postDict);
# here will automatically encode values of parameters
# such as:
# encode http://www.baidu.com/cache/user/html/jump.html into http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
#print "postData=",postData;
req = urllib2.Request(baiduMainLoginUrl, postData);
# in most case, for do POST request, the content-type, is application/x-www-form-urlencoded
req.add_header('Content-Type', "application/x-www-form-urlencoded");
resp = urllib2.urlopen(req);
#for index, cookie in enumerate(cj):
# print '[',index, ']',cookie;
cookiesToCheck = ['BDUSS', 'PTOKEN', 'STOKEN', 'SAVEUSERID'];
loginBaiduOK = checkAllCookiesExist(cookiesToCheck, cj);
if(loginBaiduOK):
print "+++ Emulate login baidu is OK, ^_^";
else:
print "--- Failed to emulate login baidu !"
else:
print "Fail to extract token value from html=",getapiRespHtml;
if __name__=="__main__":
emulateLoginBaidu();
【版本2:Python实现模拟登陆百度首页的完整代码 之 crifanLib.py版】
这个是另外一个版本,其中利用到我自己的python库:crifanLib.py :
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/
Use the functions from crifanLib.py
Note: Before try to understand following code, firstly, please read the related articles:
(1)【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
【整理】关于抓取网页,分析网页内容,模拟登陆网站的逻辑/流程和注意事项
(2) 【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
【教程】手把手教你如何利用工具(IE9的F12)去分析模拟登陆网站(百度首页)的内部逻辑过程
(3) 【教程】模拟登陆网站 之 Python版
【教程】模拟登陆网站 之 Python版(内含两种版本的完整的可运行的代码)
Version: 2012-11-07
Author: Crifan
Contact: admin (at) crifan.org
"""
import re;
import cookielib;
import urllib;
import urllib2;
import optparse;
#===============================================================================
# following are some functions, extracted from my python library: crifanLib.py
# for the whole crifanLib.py:
# online browser: http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
# download : http://code.google.com/p/crifanlib/downloads/list
#===============================================================================
import zlib;
gConst = {
'constUserAgent' : 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E)',
#'constUserAgent' : "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1",
}
################################################################################
# Network: urllib/urllib2/http
################################################################################
#------------------------------------------------------------------------------
# get response from url
# note: if you have already used cookiejar, then here will automatically use it
# while using rllib2.Request
def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
# makesure url is string, not unicode, otherwise urllib2.urlopen will error
url = str(url);
if (postDict) :
postData = urllib.urlencode(postDict);
req = urllib2.Request(url, postData);
req.add_header('Content-Type', "application/x-www-form-urlencoded");
else :
req = urllib2.Request(url);
if(headerDict) :
#print "added header:",headerDict;
for key in headerDict.keys() :
req.add_header(key, headerDict[key]);
defHeaderDict = {
'User-Agent' : gConst['constUserAgent'],
'Cache-Control' : 'no-cache',
'Accept' : '*/*',
'Connection' : 'Keep-Alive',
};
# add default headers firstly
for eachDefHd in defHeaderDict.keys() :
#print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
if(useGzip) :
#print "use gzip for",url;
req.add_header('Accept-Encoding', 'gzip, deflate');
# add customized header later -> allow overwrite default header
if(headerDict) :
#print "added header:",headerDict;
for key in headerDict.keys() :
req.add_header(key, headerDict[key]);
if(timeout > 0) :
# set timeout value if necessary
resp = urllib2.urlopen(req, timeout=timeout);
else :
resp = urllib2.urlopen(req);
return resp;
#------------------------------------------------------------------------------
# get response html==body from url
#def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) :
resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip);
respHtml = resp.read();
if(useGzip) :
#print "---before unzip, len(respHtml)=",len(respHtml);
respInfo = resp.info();
# Server: nginx/1.0.8
# Date: Sun, 08 Apr 2012 12:30:35 GMT
# Content-Type: text/html
# Transfer-Encoding: chunked
# Connection: close
# Vary: Accept-Encoding
# ...
# Content-Encoding: gzip
# sometime, the request use gzip,deflate, but actually returned is un-gzip html
# -> response info not include above "Content-Encoding: gzip"
# eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html
# -> so here only decode when it is indeed is gziped data
if( ("Content-Encoding" in respInfo) and (respInfo['Content-Encoding'] == "gzip")) :
respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
#print "+++ after unzip, len(respHtml)=",len(respHtml);
return respHtml;
################################################################################
# Cookies
################################################################################
#------------------------------------------------------------------------------
# check all cookies in cookiesDict is exist in cookieJar or not
def checkAllCookiesExist(cookieNameList, cookieJar) :
cookiesDict = {};
for eachCookieName in cookieNameList :
cookiesDict[eachCookieName] = False;
allCookieFound = True;
for cookie in cookieJar :
if(cookie.name in cookiesDict) :
cookiesDict[cookie.name] = True;
for eachCookie in cookiesDict.keys() :
if(not cookiesDict[eachCookie]) :
allCookieFound = False;
break;
return allCookieFound;
#===============================================================================
#------------------------------------------------------------------------------
# just for print delimiter
def printDelimiter():
print '-'*80;
#------------------------------------------------------------------------------
# main function to emulate login baidu
def emulateLoginBaidu():
print "Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/";
print "Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword";
printDelimiter();
# parse input parameters
parser = optparse.OptionParser();
parser.add_option("-u","--username",action="store",type="string",default='',dest="username",help="Your Baidu Username");
parser.add_option("-p","--password",action="store",type="string",default='',dest="password",help="Your Baidu password");
(options, args) = parser.parse_args();
# export all options variables, then later variables can be used
for i in dir(options):
exec(i + " = options." + i);
printDelimiter();
print "[preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies";
cj = cookielib.CookieJar();
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj));
urllib2.install_opener(opener);
printDelimiter();
print "[step1] to get cookie BAIDUID";
baiduMainUrl = "http://www.baidu.com/";
resp = getUrlResponse(baiduMainUrl);
# here you should see: BAIDUID
for index, cookie in enumerate(cj):
print '[',index, ']',cookie;
printDelimiter();
print "[step2] to get token value";
getapiUrl = "https://passport.baidu.com/v2/api/?getapi&class=login&tpl=mn&tangram=true";
getapiRespHtml = getUrlRespHtml(getapiUrl);
#bdPass.api.params.login_token='5ab690978812b0e7fbbe1bfc267b90b3';
foundTokenVal = re.search("bdPass\.api\.params\.login_token='(?P<tokenVal>\w+)';", getapiRespHtml);
if(foundTokenVal):
tokenVal = foundTokenVal.group("tokenVal");
print "tokenVal=",tokenVal;
printDelimiter();
print "[step3] emulate login baidu";
staticpage = "http://www.baidu.com/cache/user/html/jump.html";
baiduMainLoginUrl = "https://passport.baidu.com/v2/api/?login";
postDict = {
#'ppui_logintime': "",
'charset' : "utf-8",
#'codestring' : "",
'token' : tokenVal, #de3dbf1e8596642fa2ddf2921cd6257f
'isPhone' : "false",
'index' : "0",
#'u' : "",
#'safeflg' : "0",
'staticpage' : staticpage, #http%3A%2F%2Fwww.baidu.com%2Fcache%2Fuser%2Fhtml%2Fjump.html
'loginType' : "1",
'tpl' : "mn",
'callback' : "parent.bdPass.api.login._postCallback",
'username' : username,
'password' : password,
#'verifycode' : "",
'mem_pass' : "on",
};
loginRespHtml = getUrlRespHtml(baiduMainLoginUrl, postDict);
cookiesToCheck = ['BDUSS', 'PTOKEN', 'STOKEN', 'SAVEUSERID'];
loginBaiduOK = checkAllCookiesExist(cookiesToCheck, cj);
if(loginBaiduOK):
print "+++ Emulate login baidu is OK, ^_^";
else:
print "--- Failed to emulate login baidu !"
else:
print "Fail to extract token value from html=",getapiRespHtml;
if __name__=="__main__":
emulateLoginBaidu();此版本的目的在于,方便后来人使用网络相关的函数,不用关心内部细节。
并且,相关的函数,也可以供以后再次利用。
注:关于crifanLib.py:
在线浏览:crifanLib.py
上述两种版本的代码,对应的输出,都是:
D:\tmp\tmp_dev_root\python\emulate_login_baidu_python>emulate_login_baidu_python.py -u crifan -p xxxxxx Function: Used to demostrate how to use Python code to emulate login baidu main page: http://www.baidu.com/ Usage: emulate_login_baidu_python.py -u yourBaiduUsername -p yourBaiduPassword -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- [preparation] using cookieJar & HTTPCookieProcessor to automatically handle cookies -------------------------------------------------------------------------------- [step1] to get cookie BAIDUID [ 0 ] <Cookie BAIDUID=8D85C6528FDF7B5F49C746A18524495B:FG=1 for .baidu.com/> -------------------------------------------------------------------------------- [step2] to get token value tokenVal= 4d3f004bbe3e6f0cfa435abd38dd9fec -------------------------------------------------------------------------------- [step3] emulate login baidu +++ Emulate login baidu is OK, ^_^
【总结】
总的来说,其实分析网站登陆的过程,所涉及的内部逻辑,其实比用代码写出来要难多了。
而分析网站登陆过程的大概逻辑,要比用工具去具体的分析,要重要的多。
而这一堆的过程,之前自己折腾时,也正是苦于无完整的教程,所以,才有现在的一堆的帖子,来从头到尾的解释,从概念,到逻辑,到分析,到实现的整个过程。
全部都看完,应该对这部分内容,就大概有个了解的。
剩下的东西,就是实际的操练了,就是自己折腾的过程了。
希望上述所有的概念,逻辑,方法,代码,对你有用。