网页采集的陷阱

pretty网页

将html按照网页的那种缩进方式排版

1
2
3
4
# page 为页面源码
from lxml import etree, html
root = html.fromstring(page)
print(etree.tostring(root, encoding= 'utf-8', pretty_print = True))
1
2
3
4
5
from bs4 import BeautifulSoup as bs
import lxml.html as lh
root = lh.tostring(page)
soup = bs(root)
prettyHTML = soup.prettify()

遇到的问题:bytes或者str不能序列化,解决的办法是去除lxml直接用soup装载。

查看网页的陷阱

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## py3
import urllib.request
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
from bs4 import BeautifulSoup as bs

url = 'http://pythonscraping.com/pages/itsatrap.html'
response = urllib.request.urlopen(url)
page = response.read()

# root = lh.tostring(page.decode('utf-8'))
soup = bs(page.decode('utf-8'), "lxml")
prettyHTML = soup.prettify()
print (prettyHTML)

driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
driver.get("http://pythonscraping.com/pages/itsatrap.html")
links = driver.find_elements_by_tag_name("a")
for link in links:
if not link.is_displayed():
print("Trap link: " + link.get_attribute("href"))

fields = driver.find_elements_by_tag_name("input")
for field in fields:
if not field.is_displayed():
print("Hidden value: " + field.get_attribute("name"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/macbook/PycharmProjects/sg_bus/Spider/trap.py
<html>
<head>
<title>
A bot-proof form
</title>
</head>
<style>
body {
overflow-x:hidden;
}
.customHidden {
position:absolute;
right:50000px;
}
</style>
<body>
<h2>
A bot-proof form
</h2>
<a href="http://pythonscraping.com/dontgohere" style="display:none;">
Go here!
</a>
<a href="http://pythonscraping.com">
Click me!
</a>
<form>
<input name="phone" type="hidden" value="valueShouldNotBeModified"/>
<p>
</p>
<input class="customHidden" name="email" type="text" value="intentionallyBlank"/>
<p>
</p>
<input name="firstName" type="text"/>
<p>
</p>
<input name="lastName" type="text"/>
<p>
</p>
<input type="submit" value="Submit"/>
<p>
</p>
</form>
</body>
</html>
Trap link: http://pythonscraping.com/dontgohere
Hidden value: phone
Hidden value: email

避免方法

request修改header

用自带的urllib访问网站时请求头会是

1
2
Accept-Encoding identity
User-Agent Python-urllib/3.X

requests用来修改header信息,一个用来测试浏览器属性的网站是:https://www.whatismybrowser.com

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
from bs4 import BeautifulSoup

session = requests.Session()
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html, application/xhtml+xml, application/xml; q=0.9,image/webp.*/*; q=0.8"
}

url = "https://www.whatismybrowser.com"
req = session.get(url = url, headers=headers)
bsObj = BeautifulSoup(req.text, "lxml")

print(bsObj.findAll("div", class_="detected-column"))

# print(bsObj.find("table", {"class":"readout content"}).get_text)

处理cookie

cookie一方面通过在多个页面保存一个cookie从而保持登陆状态,有的网站甚至也不需要每次登陆就获得一个新的cookie,但对于反爬虫的一些网站,cookie来跟踪你的访问过程。

检查cookie的网站:www.editthiscookie.com,也可以作为chrome浏览器的插件使用。
查看cookie

1
2
3
4
5
from selenium import webdriver
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
driver.get("http://weibo.com")
driver.implicitly_wait(1)
print(driver.get_cookies())

有的网站例如google analytics当客户端执行脚本才产生cookie(或者浏览页面有点击行为等事件产生),因为我们的request不能执行这类的javascript代码,也就获取不到因此产生的cookie。

通过selenium和PhantomJS的一起使用还可以将得到的cookie用于新的webdriver,实现爬虫的伪装。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from selenium import webdriver
driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
driver.get("http://weibo.com")
driver.implicitly_wait(1)
print(driver.get_cookies())

savedCookies = driver.get_cookies()

driver2 = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
## 先加载网站,才知道cookie属于那个网站
driver2.get("http://weibo.com")
driver2.delete_all_cookies()
for cookie in savedCookies:
# # phantomjs cookies need to be prepended by a '.' for some strange reason
# if cookie['domain'][0] != '.':
# cookie['domain'] = '.' + cookie['domain']
# driver2.add_cookie({k: cookie[k] for k in ('name', 'value', 'domain', 'path', 'expiry')})

driver.add_cookie({
'domain': '.weibo.com', # note the dot at the beginning
'name': cookie['name'],
'value': cookie['value'],
'path': '/',
'expires': None
})
## 再次加载页面,比较cookie
driver2.get("http://weibo.com")
driver2.implicitly_wait(1)
print(driver2.get_cookies())

设置延时

利人利己的行为
尽管多线程可以快速加载页面(一个线程处理数据,另一个线程加载页面),对于爬虫来说应该保持以此页面加载且数据请求最小化。

time.sleep(3)