记一次毫无意义的爬虫经历

2016/9/6 posted in Python

前段时间觉得MAC的桌面背景看的有点腻了，于是想要去搞几张BG，搜了一下DOTA的主题，发现了这个比较不错的系列。

RUBICK GRAND MAGUS

作者是SHERON1030，网站上还有很多别的可以拿来做背景的图，很不错。
但是他的图库一张纸下载起来很麻烦，而我也没有找到什么打包的途径，于是考虑写个爬虫来爬。
本来之前从来没有写过爬虫，况且Python也是刚刚学了几天，那就边写边学吧！

import urllib2
import re

def downImage(address):
    path ='/Users/LiSheng/PycharmProjects/untitled'
    url = address
    p=re.compile('dota_2[\S]*')
    n=address.split('/')[2].split('.')[0]
    nameString=p.findall(url)
    name ='/Users/LiSheng/PycharmProjects/untitled/'+n+'_'+nameString[0]
    print name
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    req = urllib2.Request(url, headers)
    conn=urllib2.urlopen(url)
    f = open(name,'wb+')
    f.write(conn.read())
    f.close()
    print(nameString[0]+' download finish!')
    
imageList=[]

def addImage(address):
    file=open('/Users/LiSheng/PycharmProjects/untitled/imageAddress.txt','a')
    file.write(address+'\n')
    imageList.append(address)

def openAddress(address):
    response=urllib2.urlopen(address)
    frontPage=response.read()
    unicodePage=frontPage.decode('utf-8')
    pattern =re.compile(r'\"http://[\S]*.deviantart.com/art/Dota-2-[\S]*\"')
    urls=pattern.findall(unicodePage)

    for url in urls:
        url=url[1:url.__len__()-1]
        newResponse=urllib2.urlopen(url)
        newPage=newResponse.read()
        newUnicodePage=newPage.decode('utf-8')
        pattern =re.compile(r'<img collect_rid="[\S]*" src="[\S]*.jpg"')
        imageURL=pattern.findall(newUnicodePage)
        for i in imageURL:
            x=i.split()
            xsrc=x[2][5:x[2].__len__()-1]
            #downImage(xsrc)
            print(xsrc)
            addImage(xsrc)

url2='http://sheron1030.deviantart.com/gallery/?offset=24'
url3='http://sheron1030.deviantart.com/gallery/?offset=48'

openAddress('http://sheron1030.deviantart.com/gallery/')
openAddress(url2)
openAddress(url3)

仅以此纪念我第一次好好写的Python代码与爬虫。最后不得不说，由于学校网络爆炸，跑起来之后BOOM SHAKALAKA，运行了几次，没有以此可以顺利爬完的，pity。

尽管任务没有完成，但是非常高兴地复习了一下正则表达式。

~~你问我为什么把这么丑的代码贴上来？~~
~~那是为了让博客显得不呢么空啊兄弟啊！~~

« LaTeX和MarkDown中的数学公式

多因子选股策略经验分享记录 »

记一次毫无意义的爬虫经历

非洲小黑脸李某

Categories

Recent Posts