multiprocessing 多进程的的方式读取一个目录下的几百万文件, 程序运行的前 10 几秒能快速处理完大约 100W 文件,然后性能急剧下降,每秒中只能处理几百个文件了,百思不得其解啊。 多次运行程序都是会在 100W 左右变慢,机器 64G 内存,8 核 16 线程,机器压力不算大
代码如下:
import traceback from multiprocessing import Pool, Queue, Process
def producer(path): try: with open(path) as f: for line in f: if line.startswith('<title>'): global bug buf.put(u'{0}\t{1}'.format(path, line).encode('utf-8')) break except: traceback.print_exc() print path
def consumer(buf): with open('baidu_wiki_title.tsv', 'w') as out: while True: try: line = buf.get(timeout=30) out.write(line) except: pass
def generate(): with open('wiki.txt') as f: count = 0 for line in f: count += 1 if count % 10000==0: print count yield 'wiki/' + line.strip().split()[-1]
buf = Queue(10000000) p = Process(target=consumer, args=(buf,)) p.start() pool = Pool(100) pool.imap(producer, list(generate())) pool.close() pool.join()
1
peihanw 2017-08-26 10:53:54 +08:00
有可能是磁盘 IO 瓶颈,"iostat -x 5" 看一下,如果%util 接近 100 就说明磁盘 IO 已占满,进程都在等 IO 操作返回。如果是这种情况比较难优化。
|