安装了Scrapy之后,就去按照官网教程:
去试试。
1.通过
scrapy startproject tutorial
创建了一个新项目。
2.参考其代码,把items.py改为其所说的值。
3.新建了dmoz_spider.py,写上教程中所给的代码。
但是接下来,很悲催的是,教程中,居然没有说明“dmoz/spiders”中的dmoz,是位于什么位置,又是何时创建的文件夹。
实在不行,只有自己先去试试了。
先在和scrapy.cfg和tutorial文件夹同级的位置,建立了一个dmoz,然后在其下建立spiders文件夹,把dmoz_spider.py放进去。
然后去运行,结果出错了:
E:\Dev_Root\python\Scrapy>cd tutorial
E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz
2012-11-11 19:47:27+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial)
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats
, SpiderState
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd
leware, UrlLengthMiddleware, DepthMiddleware
2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last):
File "E:\dev_install_root\Python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "E:\dev_install_root\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 156, in <module>
execute()
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\commands\crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "E:\dev_install_root\Python27\lib\site-packages\scrapy\spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'坑爹的教程啊,很明显没有把路径解释清楚。
后来参考:
scrapy newbie: tutorial. error when running scrapy crawl dmoz
然后把dmoz_spider.py放到tutorial/tutorial/spiders下面,然后重新运行,就可以了:
E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz
2012-11-11 19:51:40+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial)
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats
, SpiderState
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd
leware, UrlLengthMiddleware, DepthMiddleware
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled item pipelines:
2012-11-11 19:51:40+0800 [dmoz] INFO: Spider opened
2012-11-11 19:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-11-11 19:51:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Boo
ks/> (referer: None)
2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Res
ources/> (referer: None)
2012-11-11 19:51:41+0800 [dmoz] INFO: Closing spider (finished)
2012-11-11 19:51:41+0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13061,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 11, 11, 11, 51, 41, 506000),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2012, 11, 11, 11, 51, 40, 630000)}
2012-11-11 19:51:41+0800 [dmoz] INFO: Spider closed (finished)Scrapy这个项目,貌似文档方面,还是做的很不到位啊。
连最基本的这个教程,竟然路径方面都解释的很不清楚,让人产生混淆。真的很假。。。
4.后来,就是继续安装教程所给的代码,去测试了一下,最后的一次是通过代码dmoz_spider.py:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items运行:
scrapy crawl dmoz -o items.json -t json
获得了输出的items.json:
[{"desc": ["\n "], "link": ["/"], "title": ["Top"]},
{"desc": [], "link": ["/Computers/"], "title": ["Computers"]},
{"desc": [], "link": ["/Computers/Programming/"], "title": ["Programming"]},
{"desc": [], "link": ["/Computers/Programming/Languages/"], "title": ["Languages"]},
{"desc": [], "link": ["/Computers/Programming/Languages/Python/"], "title": ["Python"]},
{"desc": ["\n \t", "\u00a0", "\n "], "link": [], "title": []},
{"desc": ["\n ", " \n ", "\n "], "link": ["/Computers/Programming/Languages/Python/Resources/"], "title": ["Computers: Programming: Languages: Python: Resources"]},
...
]【总结】
貌似大概看了下其给出的一些链接,貌似Scrapy,功能还是很强大的。
剩下的,就是有空再去看看
其中有几乎所有的内容,值得折腾折腾。
转载请注明:在路上 » 【记录】折腾Scrapy的Tutorial