24小时热门版块排行榜

返回列表

lc1296925640

银虫 (小有名气)

应助: 0 (幼儿园)
金币: 1181.9
红花: 1
帖子: 260
在线: 10.8小时
虫号: 3023702
注册: 2014-03-06
专业: 考古技术

[求助] 【马甲行为已确定】爬个小说已有1人参与

爬小说代码

[ Last edited by 月只蓝 on 2018-6-6 at 21:34 ]

回复此楼

» 猜你喜欢

1楼 2018-05-25 16:41:15

已阅回复此楼关注TA 给TA发消息送TA红花 TA的回帖

木小瓷0512

金虫 (小有名气)

应助: 5 (幼儿园)
金币: 404.9
散金: 377
红花: 5
帖子: 288
在线: 130.8小时
虫号: 1869519
注册: 2012-06-25
性别: GG
专业: 生物医学传感

★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ ★
感谢参与，应助指数 +1
lc1296925640: 金币+300, ★★★★★最佳答案 2018-05-25 16:41:53
jjdg: 金币+2, 感谢参与 2018-05-25 17:14:06
月只蓝: 请注意！经查，您涉嫌金币转移行为，请72小时之内，通过站内短消息解释您和 lc1296925640 的关系，逾期将按金币转移行为处理。金币转移行为界定依据：http://muchong.com/bbs/viewthread.php?tid=3711903 2018-05-26 10:46:00
月只蓝: 金币-120, 马甲行为确认，扣除非法所得金币 2018-06-06 21:30:46
月只蓝: 金币-120, 马甲行为确认，扣除非法所得金币共计240个 2018-06-06 21:30:57
月只蓝: 金币-100, 应助指数-2, 马甲存档, 多次马甲行为，扣除金币作处罚！ 2018-06-06 21:33:25

CODE:

import urllib.request

import re  # 正则

# 1 获取主页源代码

# 2 获取章节超链接

# 3 获取章节内容

# 4 下载小说

# 驼峰命名法

def getNovelContent():

    # 获取源代码

    html=urllib.request.urlopen('http://www.quanshuwang.com/book/0/269/').read()#.read() 如果打印输出状态 就先不以阅读模式打开

    # 输出访问状态  状态访问后需要关闭，爬取之前先获取状态，看是否可以打开和爬取

    #print(html.status)

    #exit()

    # 设置编码，查看源代码

    html=html.decode('gbk')

    # 获取章节超链接

    # <a href="[url]http://www.quanshuwang.com/book/0/269/78850.html"[/url] title="第一章 山边小村，共2741字">第一章 山边小村</a>

    # 正则表达式  需要有通配符  .*?  = 匹配所有

    #reg=r'<li><a href="[url]http://www.quanshuwang.com/book/0/269/78850.html"[/url] title="第一章 山边小村，共2741字">第一章 山边小村</a></li>'

    reg = r'<li><a href="(.*?)" title=".*?">(.*?)</a></li>'  #加括号表示该数据是想要的数据，不加括号表示不参与匹配（分组匹配)

    # 目的是增加匹配效率  变正则表达式字符串为对象

    reg=re.compile(reg)

    urls= re.findall(reg,html)  #按照正则表达式的规则reg 在html页面查找所有内容

    #print(urls) # 输出列表信息

    for i in urls:

        #print(i[0])

        novel_url=i[0]

        novel_title=i[1]

        chapt=urllib.request.urlopen(novel_url).read() #获取小说页面源码

        chapt_html=chapt.decode('gbk')

        #print(chapt_html) #获得小说页面

        # r'/d' r原生字符串

        reg='</script>    (.*?)<script type="text/">' #提取文本内容

        # re.S  S 表示多行匹配

        reg = re.compile(reg,re.S)

        chapt_content = re.findall(reg, chapt_html)

        # print(chapt_content) 获得信息，[0]为内容

        #print(chapt_content[0])

        # 替换

        chapt_content =chapt_content[0].replace('<br />',"")  # 获得字符串，之后不需要再写[0]

        #print(chapt_content)

        chapt_content = chapt_content.replace('    ', "")

        # print(chapt_content)

        # 下载

        print("正在保存 %s"% novel_title)

        # w读写模式  wb 二进制读写模式，一般用来保存图片和视频

        f= open('{}.txt'.format(novel_title),'w')

        f.write(chapt_content)

        f.close()

        # 另一种保存文件方式

        #with open('{}.txt'.format(novel_title),'w') as f:

        #   f.write(chapt_content)

getNovelContent()

赞一下

回复此楼

学无止境

2楼2018-05-25 16:41:37