[编程求助] python网络爬虫求助_python爬虫

AnswerDSL

求助大神们：下面是一个python的小爬虫，本想爬取网站上的新闻标题，日期和点击量，但运行结果中却并没有出现这些内容只有  日期：点击量：。这是什么问题呢？新手刚入道，望多多指教！

#! /usr/bin/env python
#coding=gbk

import urllib2
import sys
import re
import os

def extract_url(info):
rege="<li><span class=\"title\"><a href=\"(.*?)\">"
re_url = re.findall(rege, info)
n=len(re_url)
for i in range(0,n):
      re_url="http://news.swjtu.edu.cn/"+re_url
return re_url

def extract_title(sub_web):
re_key = "<h4>\r\n (.*)\r\n </h4>"
title = re.findall(re_key,sub_web) or [""]
return title

def extract_date(sub_web):
re_key = "日期：(.*?) "
date = re.findall(re_key,sub_web) or [""]
return date

def extract_counts(sub_web):
re_key = "点击数：(.*?)  "
counts = re.findall(re_key,sub_web) or [""]
return counts

fp=open('output.txt','w')
content = urllib2.urlopen('http://news.swjtu.edu.cn/ShowList-82-0-1.shtml').read()
url=extract_url(content)
string=""
n=len(url)
print n

for i in range(0,n):
sub_web = urllib2.urlopen(url).read()
sub_title = extract_title(sub_web)
string+=sub_title[0]
string+=''
sub_date = extract_date(sub_web)
string+="日期："+sub_date[0]
string+=''
sub_counts = extract_counts(sub_web)
string+="点击数："+sub_counts[0]
string+='\n'

print string
fp.close()

shenzhenwan10 · 发表于 2016-10-20 16:39:52

本帖最后由 shenzhenwan10 于 2016-10-20 16:42 编辑

我看这个程序是用正则表达式来提取内容的
如果确定爬取的网页上有新闻标题，那就需要再测试下提取的正则表达式是否正确

Fuller · 发表于 2016-10-20 17:43:38

正则表达式的调试还是很麻烦的，我们提出了一个新思路，可以在Python环境下做成通用的网络爬虫，提取规则从外部注入，直接利用MS谋数台生成的规则，具体参看：http://www.gooseeker.com/doc/thread-1740-1-1.html

[编程求助] python网络爬虫求助_python爬虫

共 2 个关于本帖的回复最后回复于 2016-10-20 17:43

推荐板块

精彩推荐

热门话题

热门用户

	B Color Image Link Quote Code Smilies 高级模式您需要登录后才可以回帖登录 \| 立即注册回帖并转播回帖后跳转到最后一页

[编程求助] python网络爬虫求助_python爬虫

共 2 个关于本帖的回复 最后回复于 2016-10-20 17:43

推荐板块

精彩推荐

热门话题

热门用户

共 2 个关于本帖的回复最后回复于 2016-10-20 17:43