请问python和ruby谁写爬虫更方便好使

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 4678 天前的主题，其中的信息可能已经有所发展或是发生改变。

Python的貌似多些

Ruby

Python

13 条回复 • 1970-01-01 08:00:00 +08:00

cam

2012-12-09 18:07:14 +08:00

Ruby快些，如果只是parse HTML就用Nokogiri,
如果需要处理Session就用mechanize
信我的，没错！
https://github.com/sparklemotion/mechanize
https://github.com/sparklemotion/nokogiri

wingoo

2012-12-09 18:26:50 +08:00

python scrapy

fwee

2012-12-09 18:38:27 +08:00

差不了多少吧，再推荐个ruby的eventmachine异步库

jerry

2012-12-09 18:45:52 +08:00

@wingoo python scrapy +1 这个基本只需要写规则了

liuxurong

2012-12-09 19:51:43 +08:00

requests

muxi

2012-12-10 00:15:58 +08:00

用scrapy 同学不知道有没有碰到动手加东西的时候或者不满足需要改造的时候？
那段日子简直就是我的噩梦这玩意真复杂，如果不是做通用抓取，建议还是别用了

定向抓取或许requests 之类更合适

zuroc

2012-12-10 00:23:51 +08:00

http://matrix.42qu.com/10732773

zuroc

2012-12-10 00:25:06 +08:00

code for example

#coding:utf-8
from spider.spider import route, Handler, spider
import _env
from os.path import abspath, dirname, join
from operator import itemgetter

PREFIX = join(dirname(abspath(__file__)))
HTTP = 'http://www.ecocn.org/%s'

@route('/portal\.php')
class portal(Handler):
def get(self):
for link in self.extract_all('<dt class="xs2"><a href="', '"'):
spider.put(HTTP%link)

@route('/article-\d+-\d+.html')
class article(Handler):
def get(self):
link = self.extract( 'class="pn" href="', '" target=""> 中英对照')
spider.put(HTTP%link)

@route('/forum\.php')
class forum(Handler):
from mako.lookup import Template
template = Template(filename=join(PREFIX, 'template/rss.xml'))

page = []

def get(self):
name = self.extract('id="thread_subject">', '</a>')
if not name:
return
name = name.split(']', 1)[-1].strip()
html = self.extract('<div class="t_fsz">', '<div id="comment_')
html = html[:html.rfind('</div>')]
tid = int(self.get_argument('tid'))
print tid, name
self.page.append((tid, self.request.url, name, html))

@classmethod
def write(cls):
page = cls.page
page.sort(key=itemgetter(0), reverse=True)
with open(join(PREFIX, 'ecocn_org.xml'), 'w') as rss:
rss.write(
cls.template.render(
rss_title='经济学人 . 中文网',
rss_link='http://www.ecocn.org',
li=[
dict(
link=link,
title=title,
txt=txt
) for id, link, title, txt in cls.page
]
)
)

if __name__ == '__main__':
spider.put('http://www.ecocn.org/portal.php?mod=list&catid=1')
#10个并发抓取线程 , 网页读取超时时间为30秒
spider.run(10, 30)
forum.write()

kenlen

2012-12-10 00:29:20 +08:00

刚看了个python 和scrapy 一起抓美图的例子. http://bbs.chinaunix.net/thread-4057457-1-1.html

oa414

2012-12-10 01:04:26 +08:00

HowardMei

2012-12-10 10:48:22 +08:00

在爬虫上面，python甩其它语言几条街：
https://scraperwiki.com/docs/python/python_libraries/
ruby的没有python全面，很多库都是从python port过去的
这家爬虫平台就是python写的 https://bitbucket.org/ScraperWiki/scraperwiki/src

毕竟，蟒本来就是爬虫之王么 :)

HowardMei

2012-12-10 11:48:36 +08:00

还有一个做语义分析的框架也蛮有意思： http://www.clips.ua.ac.be/pages/pattern

messense

2012-12-10 12:15:18 +08:00

python pattern+pyquery库