求助!ASP.NET 采集网站数据问题?100分!!

zgz1989410 2012-02-17 06:11:16
要采集http://www.sooker.com/xuexiao/这个页面所有的学校信息(学校名称、学校图片路径、学校链接地址),
现在我通过HttpWebRequest请求资源,返回来的结果,用正则只能查找学校链接地址。
想要的结果是把每个学校的名称、学校图片路径、学校链接地址都采集下来并且以“**学校,**学校图片路径,**学校链接地址|**学校,**学校图片路径,**学校链接地址”格式出来。

结果如下:
新尚教育人民广场校区,http://www.sooker.com/data/files/store_551148/other/school_logo.jpg,http://www.sooker.com/551148/|
上海精锐教育黄浦西藏南路中心,http://www.sooker.com/data/files/store_321706/other/school_banner.png,http://www.sooker.com/321706/|
上海精锐教育嘉定丰庄中心,http://www.sooker.com/data/files/store_321719/other/school_banner.png,http://www.sooker.com/321719/|
.....
贝乐学科英语,http://www.sooker.com/data/files/store_554906/other/school_logo.jpg,http://www.sooker.com/554906/|

大家帮帮忙把这个问题解决了,并且把代码发出来。。谢谢!!
...全文
200 14 打赏 收藏 转发到动态 举报
AI 作业
写回复
用AI写文章
14 条回复
切换为时间正序
请发表友善的回复…
发表回复
zgz1989410 2012-02-24
  • 打赏
  • 举报
回复
这几天有点事,耽误了,非常感谢“dalmeeme”
rr998 2012-02-20
  • 打赏
  • 举报
回复
[Quote=引用 11 楼 dalmeeme 的回复:]

.//div[@class='pic']/a/img
当前(li)元素下class属性为pic的div后代,下面的a儿子,下面的img儿子。

.//a[@class='school-name']
当前(li)元素下class属性为school-name的a后代。

注意儿子和后代的区别。
[/Quote]

嗯 谢谢。
夜色镇歌 2012-02-20
  • 打赏
  • 举报
回复
用HtmlAgilityPack+1

[Quote=引用 2 楼 dalmeeme 的回复:]

用HtmlAgilityPack,自行网上下载dll文件,获取:
C# code
HttpWebRequest httpWebRequest = WebRequest.Create(@"http://www.sooker.com/xuexiao/") as HttpWebRequest;
HttpWebResponse httpWebResponse = ht……
[/Quote]
dalmeeme 2012-02-20
  • 打赏
  • 举报
回复
.//div[@class='pic']/a/img
当前(li)元素下class属性为pic的div后代,下面的a儿子,下面的img儿子。

.//a[@class='school-name']
当前(li)元素下class属性为school-name的a后代。

注意儿子和后代的区别。
rr998 2012-02-20
  • 打赏
  • 举报
回复
[Quote=引用 4 楼 dalmeeme 的回复:]
不好意思,反掉了,更正一下:

C# code

HttpWebRequest httpWebRequest = WebRequest.Create(@"http://www.sooker.com/xuexiao/") as HttpWebRequest;
HttpWebResponse httpWebResponse = httpWebRequest.G……
[/Quote]

能不能说下,采集这样写的规则啊


HtmlNode img = lis[i].SelectSingleNode(".//div[@class='pic']/a/img");
HtmlNode anchor = lis[i].SelectSingleNode(".//a[@class='school-name']");

rr998 2012-02-20
  • 打赏
  • 举报
回复
学习了!收藏起来!
dalmeeme 2012-02-19
  • 打赏
  • 举报
回复
开头using HtmlAgilityPack;
dalmeeme 2012-02-19
  • 打赏
  • 举报
回复
这个HtmlDocument是属于HtmlAgilityPack的,http://www.codeplex.com/htmlagilitypack
zgz1989410 2012-02-19
  • 打赏
  • 举报
回复
编译器错误消息: CS0234: 命名空间“System”中不存在类型或命名空间名称“Windows”(是缺少程序集引用吗?)
源错误:
行 13: using System.Net;
行 14: using System.Text;
行 15: using System.Windows.Forms;


大哥,BS架构好像没有 System.Windows.Forms.HtmlDocument 怎么解决啊?

dalmeeme 2012-02-17
  • 打赏
  • 举报
回复
不好意思,反掉了,更正一下:
		HttpWebRequest httpWebRequest = WebRequest.Create(@"http://www.sooker.com/xuexiao/") as HttpWebRequest;
HttpWebResponse httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
Stream stream = httpWebResponse.GetResponseStream();
StreamReader reader = new StreamReader(stream, Encoding.GetEncoding("gb2312"));
string s = reader.ReadToEnd();
reader.Close();
httpWebResponse.Close();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(s);
HtmlNodeCollection lis = htmlDoc.DocumentNode.SelectNodes(@"//ul[@class='curriculumUl']/li");
string[] results = new string[lis.Count];
for (int i = 0; i < lis.Count; i++)
{
HtmlNode img = lis[i].SelectSingleNode(".//div[@class='pic']/a/img");
HtmlNode anchor = lis[i].SelectSingleNode(".//a[@class='school-name']");
results[i] = string.Format("{0},{1},{2}", anchor.InnerHtml, img.Attributes["src"].Value, anchor.Attributes["href"].Value);
}
string r = string.Join("|", results);
Response.Write(r);
Response.End();

输出:
新尚教育人民广场校区,http://www.sooker.com/data/files/store_551148/other/school_logo.jpg,http://www.sooker.com/551148/|上海精锐教育黄浦西藏南路中心,http://www.sooker.com/data/files/store_321706/other/school_banner.png,http://www.sooker.com/321706/|上海精锐教育嘉定丰庄中心,http://www.sooker.com/data/files/store_321719/other/school_banner.png,http://www.sooker.com/321719/|上海精锐教育闵行龙柏中心,http://www.sooker.com/data/files/store_548150/other/school_banner.png,http://www.sooker.com/548150/|杭州京翰教育庆春校区,http://www.sooker.com/data/files/store_550463/other/school_banner.jpg,http://www.sooker.com/550463/|北京艺海星图艺术培训机构,http://www.sooker.com/data/files/store_549020/other/school_logo.jpg,http://www.sooker.com/549020/|北京博雅环球,http://www.sooker.com/data/files/store_551650/other/school_logo.jpg,http://www.sooker.com/551650/|启德学府,http://www.sooker.com/data/files/store_320298/other/school_logo.png,http://www.sooker.com/320298/|石家庄师大美术辅导美意培训中心,http://www.sooker.com/data/files/store_321068/other/school_banner.jpg,http://www.sooker.com/321068/|贝乐学科英语,http://www.sooker.com/data/files/store_554906/other/school_logo.jpg,http://www.sooker.com/554906/
dalmeeme 2012-02-17
  • 打赏
  • 举报
回复
按照楼主要求的格式:
		HttpWebRequest httpWebRequest = WebRequest.Create(@"http://www.sooker.com/xuexiao/") as HttpWebRequest;
HttpWebResponse httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
Stream stream = httpWebResponse.GetResponseStream();
StreamReader reader = new StreamReader(stream, Encoding.GetEncoding("gb2312"));
string s = reader.ReadToEnd();
reader.Close();
httpWebResponse.Close();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(s);
HtmlNodeCollection lis = htmlDoc.DocumentNode.SelectNodes(@"//ul[@class='curriculumUl']/li");
string[] results = new string[lis.Count];
for (int i = 0; i < lis.Count; i++)
{
HtmlNode img = lis[i].SelectSingleNode(".//div[@class='pic']/a/img");
HtmlNode anchor = lis[i].SelectSingleNode(".//a[@class='school-name']");
results[i] = string.Format("{0},{1},{2}", img.Attributes["src"].Value, anchor.InnerHtml, anchor.Attributes["href"].Value);
}
string r = string.Join("|", results);
Response.Write(r);
Response.End();

输出:
http://www.sooker.com/data/files/store_551148/other/school_logo.jpg,新尚教育人民广场校区,http://www.sooker.com/551148/|http://www.sooker.com/data/files/store_321706/other/school_banner.png,上海精锐教育黄浦西藏南路中心,http://www.sooker.com/321706/|http://www.sooker.com/data/files/store_321719/other/school_banner.png,上海精锐教育嘉定丰庄中心,http://www.sooker.com/321719/|http://www.sooker.com/data/files/store_548150/other/school_banner.png,上海精锐教育闵行龙柏中心,http://www.sooker.com/548150/|http://www.sooker.com/data/files/store_550463/other/school_banner.jpg,杭州京翰教育庆春校区,http://www.sooker.com/550463/|http://www.sooker.com/data/files/store_549020/other/school_logo.jpg,北京艺海星图艺术培训机构,http://www.sooker.com/549020/|http://www.sooker.com/data/files/store_551650/other/school_logo.jpg,北京博雅环球,http://www.sooker.com/551650/|http://www.sooker.com/data/files/store_320298/other/school_logo.png,启德学府,http://www.sooker.com/320298/|http://www.sooker.com/data/files/store_321068/other/school_banner.jpg,石家庄师大美术辅导美意培训中心,http://www.sooker.com/321068/|http://www.sooker.com/data/files/store_554906/other/school_logo.jpg,贝乐学科英语,http://www.sooker.com/554906/
dalmeeme 2012-02-17
  • 打赏
  • 举报
回复
用HtmlAgilityPack,自行网上下载dll文件,获取:
		HttpWebRequest httpWebRequest = WebRequest.Create(@"http://www.sooker.com/xuexiao/") as HttpWebRequest;
HttpWebResponse httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
Stream stream = httpWebResponse.GetResponseStream();
StreamReader reader = new StreamReader(stream, Encoding.GetEncoding("gb2312"));
string s = reader.ReadToEnd();
reader.Close();
httpWebResponse.Close();
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(s);
HtmlNodeCollection imgs = htmlDoc.DocumentNode.SelectNodes(@"//ul[@class='curriculumUl']/li//div[@class='pic']/a/img");
foreach (HtmlNode img in imgs)
Response.Write(img.Attributes["src"].Value + "<br/>");
HtmlNodeCollection anchors = htmlDoc.DocumentNode.SelectNodes(@"//ul[@class='curriculumUl']/li//a[@class='school-name']");
foreach (HtmlNode anchor in anchors)
{
Response.Write(anchor.Attributes["href"].Value + "<br/>");
Response.Write(anchor.InnerHtml + "<br/>");
}
Response.End();

输出:
http://www.sooker.com/data/files/store_551148/other/school_logo.jpg
http://www.sooker.com/data/files/store_321706/other/school_banner.png
http://www.sooker.com/data/files/store_321719/other/school_banner.png
http://www.sooker.com/data/files/store_548150/other/school_banner.png
http://www.sooker.com/data/files/store_550463/other/school_banner.jpg
http://www.sooker.com/data/files/store_549020/other/school_logo.jpg
http://www.sooker.com/data/files/store_551650/other/school_logo.jpg
http://www.sooker.com/data/files/store_320298/other/school_logo.png
http://www.sooker.com/data/files/store_321068/other/school_banner.jpg
http://www.sooker.com/data/files/store_554906/other/school_logo.jpg
http://www.sooker.com/551148/
新尚教育人民广场校区
http://www.sooker.com/321706/
上海精锐教育黄浦西藏南路中心
http://www.sooker.com/321719/
上海精锐教育嘉定丰庄中心
http://www.sooker.com/548150/
上海精锐教育闵行龙柏中心
http://www.sooker.com/550463/
杭州京翰教育庆春校区
http://www.sooker.com/549020/
北京艺海星图艺术培训机构
http://www.sooker.com/551650/
北京博雅环球
http://www.sooker.com/320298/
启德学府
http://www.sooker.com/321068/
石家庄师大美术辅导美意培训中心
http://www.sooker.com/554906/
贝乐学科英语
喜阳阳 2012-02-17
  • 打赏
  • 举报
回复

62,243

社区成员

发帖
与我相关
我的任务
社区描述
.NET技术交流专区
javascript云原生 企业社区
社区管理员
  • ASP.NET
  • .Net开发者社区
  • R小R
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告

.NET 社区是一个围绕开源 .NET 的开放、热情、创新、包容的技术社区。社区致力于为广大 .NET 爱好者提供一个良好的知识共享、协同互助的 .NET 技术交流环境。我们尊重不同意见,支持健康理性的辩论和互动,反对歧视和攻击。

希望和大家一起共同营造一个活跃、友好的社区氛围。

试试用AI创作助手写篇文章吧