请教:如何采集到这个页面的联系方式

czmws 2010-05-06 04:29:38
阿里巴巴 企业信息页面,比如
http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html
如何采集到这个页面的联系方式,这个页面是不是采用了防采集技术,按正常采集方法联系方式采不到,请高手帮忙
...全文
186 18 打赏 收藏 转发到动态 举报
写回复
用AI写文章
18 条回复
切换为时间正序
请发表友善的回复…
发表回复
李小冲 2010-05-20
  • 打赏
  • 举报
回复
一会儿回公司把程序给你弄好,呵呵!
pei2lala 2010-05-18
  • 打赏
  • 举报
回复
汗 大哥 你可以采了 还不结分? 下次不回答你问题了
czmws 2010-05-10
  • 打赏
  • 举报
回复
[Quote=引用 15 楼 pei2lala 的回复:]
哦 你那个方法不行
我把我的贴出来 你运行一下就知道了
嘿嘿 代码是直接从项目中copy过来的 没去整理了

这是采集页面
default.cs
private System.Net.CookieContainer cookie = new System.Net.CookieContainer();
string result = HttpHelper.GetHtml("……
[/Quote]

多谢!可以采了
pei2lala 2010-05-07
  • 打赏
  • 举报
回复
哦 你那个方法不行
我把我的贴出来 你运行一下就知道了
嘿嘿 代码是直接从项目中copy过来的 没去整理了

这是采集页面
default.cs
private System.Net.CookieContainer cookie = new System.Net.CookieContainer();
string result = HttpHelper.GetHtml("http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html", "", false, cookie);




HttpHelper.cs
private static Encoding encoding = Encoding.GetEncoding("gb2312");
#region 获取HTML
/// <summary>
/// 获取HTML
/// </summary>
/// <param name="url">地址</param>
/// <param name="postData">post 提交的字符串</param>
/// <param name="isPost">是否是post</param>
/// <param name="cookieContainer">CookieContainer</param>
/// <returns>html </returns>
public static string GetHtml(string url, string postData, bool isPost, CookieContainer cookieContainer)
{
if (string.IsNullOrEmpty(postData))
{
return GetHtml(url, cookieContainer);
}

Thread.Sleep(NetworkDelay);//等待

currentTry++;

HttpWebRequest httpWebRequest = null;
HttpWebResponse httpWebResponse = null;
try
{
byte[] byteRequest = Encoding.Default.GetBytes(postData);

httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
httpWebRequest.CookieContainer = cookieContainer;
httpWebRequest.ContentType = contentType;
httpWebRequest.ServicePoint.ConnectionLimit = maxTry;
httpWebRequest.Referer = url;
httpWebRequest.Accept = accept;
httpWebRequest.UserAgent = userAgent;
httpWebRequest.Method = isPost ? "POST" : "GET";
httpWebRequest.ContentLength = byteRequest.Length;

Stream stream = httpWebRequest.GetRequestStream();
stream.Write(byteRequest, 0, byteRequest.Length);
stream.Close();

httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream, encoding);
string html = streamReader.ReadToEnd();
streamReader.Close();
responseStream.Close();
currentTry = 0;

httpWebRequest.Abort();
httpWebResponse.Close();

return html;
}
catch (Exception e)
{
//Console.ForegroundColor = ConsoleColor.Red;
//Console.WriteLine(DateTime.Now.ToString("HH:mm:ss ") + e.Message);
//Console.ForegroundColor = ConsoleColor.White;

if (currentTry <= maxTry)
{
GetHtml(url, postData, isPost, cookieContainer);
}
currentTry--;

if (httpWebRequest != null)
{
httpWebRequest.Abort();
} if (httpWebResponse != null)
{
httpWebResponse.Close();
}
return string.Empty;
}
}
/// <summary>
/// 获取HTML
/// </summary>
/// <param name="url">地址</param>
/// <param name="cookieContainer">CookieContainer</param>
/// <returns>HTML</returns>
public static string GetHtml(string url, CookieContainer cookieContainer)
{
Thread.Sleep(NetworkDelay);

currentTry++;
HttpWebRequest httpWebRequest = null;
HttpWebResponse httpWebResponse = null;
try
{

httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
httpWebRequest.CookieContainer = cookieContainer;
httpWebRequest.ContentType = contentType;
httpWebRequest.ServicePoint.ConnectionLimit = maxTry;
httpWebRequest.Referer = url;
httpWebRequest.Accept = accept;
httpWebRequest.UserAgent = userAgent;
httpWebRequest.Method = "GET";

httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
StreamReader streamReader = new StreamReader(responseStream, encoding);
string html = streamReader.ReadToEnd();
streamReader.Close();
responseStream.Close();

currentTry--;

httpWebRequest.Abort();
httpWebResponse.Close();

return html;
}
catch (Exception e)
{
//Console.ForegroundColor = ConsoleColor.Red;
//Console.WriteLine(DateTime.Now.ToString("HH:mm:ss ") + e.Message);
//Console.ForegroundColor = ConsoleColor.White;

if (currentTry <= maxTry)
{
GetHtml(url, cookieContainer);
}

currentTry--;

if (httpWebRequest != null)
{
httpWebRequest.Abort();
} if (httpWebResponse != null)
{
httpWebResponse.Close();
}
return string.Empty;
}
}
/// <summary>
/// 获取HTML
/// </summary>
/// <param name="url">地址</param>
/// <returns>HTML</returns>
public static string GetHtml(string url)
{
return GetHtml(url, cc);
}
/// <summary>
/// 获取HTML
/// </summary>
/// <param name="url">地址</param>
/// <param name="postData">提交的字符串</param>
/// <param name="isPost">是否是POST</param>
/// <returns>HTML</returns>
public static string GetHtml(string url, string postData, bool isPost)
{
return GetHtml(url, postData, isPost, cc);
}
/// <summary>
/// 获取字符流
/// </summary>
/// <param name="url">地址</param>
/// <param name="cookieContainer">cookieContainer</param>
/// <returns>Stream</returns>
public static Stream GetStream(string url, CookieContainer cookieContainer)
{
//Thread.Sleep(delay);

currentTry++;
HttpWebRequest httpWebRequest = null;
HttpWebResponse httpWebResponse = null;

try
{

httpWebRequest = (HttpWebRequest)HttpWebRequest.Create(url);
httpWebRequest.CookieContainer = cookieContainer;
httpWebRequest.ContentType = contentType;
httpWebRequest.ServicePoint.ConnectionLimit = maxTry;
httpWebRequest.Referer = url;
httpWebRequest.Accept = accept;
httpWebRequest.UserAgent = userAgent;
httpWebRequest.Method = "GET";

httpWebResponse = (HttpWebResponse)httpWebRequest.GetResponse();
Stream responseStream = httpWebResponse.GetResponseStream();
currentTry--;

//httpWebRequest.Abort();
//httpWebResponse.Close();

return responseStream;
}
catch (Exception e)
{
//Console.ForegroundColor = ConsoleColor.Red;
//Console.WriteLine(DateTime.Now.ToString("HH:mm:ss ") + e.Message);
//Console.ForegroundColor = ConsoleColor.White;

if (currentTry <= maxTry)
{
GetHtml(url, cookieContainer);
}

currentTry--;

if (httpWebRequest != null)
{
httpWebRequest.Abort();
} if (httpWebResponse != null)
{
httpWebResponse.Close();
}
return null;
}
}
#endregion

czmws 2010-05-07
  • 打赏
  • 举报
回复
[Quote=引用 12 楼 pei2lala 的回复:]
http://www.weather.com.cn/html/weather/101010100.shtml

http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html

你到底要采集那个? 我这边采集是正常的 可以采集到
[/Quote]
pei2lala你好:
采集这个http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html
能采集到吗?你是怎么采集的?
czmws 2010-05-07
  • 打赏
  • 举报
回复
[Quote=引用 11 楼 kkbac 的回复:]
采集的代码拿出来看看.
[/Quote]

private string GetHtml(string url)
{
string strHtml = string.Empty;
try
{
StreamReader sr = null;
System.Text.Encoding code = Encoding.Default;
WebRequest HttpWebRequest = null;
WebResponse HttpWebResponse = null;
HttpWebRequest = WebRequest.Create(url);
HttpWebResponse = HttpWebRequest.GetResponse();

sr = new StreamReader(HttpWebResponse.GetResponseStream(), code);
strHtml = sr.ReadToEnd();
sr.Close();
HttpWebResponse.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
return strHtml;
}

url是:http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html
pei2lala 2010-05-07
  • 打赏
  • 举报
回复
http://www.weather.com.cn/html/weather/101010100.shtml

http://bjdfmsp.cn.alibaba.com/athena/contact/bjdfmsp.html

你到底要采集那个? 我这边采集是正常的 可以采集到
kkbac 2010-05-07
  • 打赏
  • 举报
回复
采集的代码拿出来看看.
jiankeqcaf 2010-05-07
  • 打赏
  • 举报
回复
期待高手
czmws 2010-05-06
  • 打赏
  • 举报
回复
[Quote=引用 6 楼 jack15850798154 的回复:]
这是我测试的,应该可以的。看看是否正则表达式写的有问题呢?
string html = @"<ul class='mainTextColor'>
<li >电&nbsp;&nbsp;&nbsp;&nbsp;话: 86 010 65420280</li>
<li >移动电话: 13699287621</li>
<li >传&nbsp……
[/Quote]

现在的情况是采集不到页面,还没到用正则表达式提取信息这步呢
czmws 2010-05-06
  • 打赏
  • 举报
回复
[Quote=引用 4 楼 kkbac 的回复:]
是采集不到页面么?
[/Quote]

是采集不到页面
zhouwei7682719 2010-05-06
  • 打赏
  • 举报
回复
过来学习!!
jack15850798154 2010-05-06
  • 打赏
  • 举报
回复
这是我测试的,应该可以的。看看是否正则表达式写的有问题呢?
string html = @"<ul class='mainTextColor'>
<li >电    话: 86 010 65420280</li>
<li >移动电话: 13699287621</li>
<li >传    真: 86 010 65489805</li>
<li >地    址: 中国 北京 北京市朝阳区 石各庄541号</li>
<li >邮    编: 100201</li>
<li>公司主页:
<a class='draft_no_link' href='http://www.bjdfmsp.com.cn' target='_blank'>http://www.bjdfmsp.com.cn</a>
<br/>
<a style='margin-left:67px' class='draft_no_link' href='http://bjdfmsp.cn.alibaba.com' target='_blank'>http://bjdfmsp.cn.alibaba.com</a>
</li>
</ul>";


string pstr = "<ul class='mainTextColor'><li >电    话: 86 010 65420280</li>";
//电话号码验证码
// Match m = Regex.Match(html, " (\\d+-)?(\\d{4}-?\\d{7}|\\d{3}-?\\d{8}|^\\d{7,8})(-\\d+)?", RegexOptions.Singleline);
Match m1 = Regex.Match(html, "(\\d{11})|(\\d{3}|(\\d{3}|\\d{4})-)?(\\d{8}|\\d{7})|([1][2]\\d{1}|[0]\\d{3}-)?(\\d{7}|\\d{8})", RegexOptions.Singleline);
// Response.Write(m.Value);
Response.Write(m1.Value);
czmws 2010-05-06
  • 打赏
  • 举报
回复
我获取的页面地址明明是http://www.weather.com.cn/html/weather/101010100.shtml这个地址,为什么采集后得到的字符串代码却变成了别的页面的代码了呢?

newdigitime请问,如果页面检查了 Request.UrlReferrer 请求来源,我该怎么处理呢
kkbac 2010-05-06
  • 打赏
  • 举报
回复
是采集不到页面么?
newdigitime 2010-05-06
  • 打赏
  • 举报
回复
除非页面检查了 Request.UrlReferrer 请求来源.
除此之外,没看出它有任何防采集.

应该是你的正则规则不对吧.

你可以把采集后得到的整个字符串显示出来看看
jack15850798154 2010-05-06
  • 打赏
  • 举报
回复
哥们给你几个例子自己总结:http://topic.csdn.net/u/20090304/16/774442d1-2605-4e51-80e6-da6ebc91b39d.html


http://www.zz68.net/program/aspnet/200905/0503H009.html
jack15850798154 2010-05-06
  • 打赏
  • 举报
回复
没有写过,不过估计要用到正则表达式。。。

62,072

社区成员

发帖
与我相关
我的任务
社区描述
.NET技术交流专区
javascript云原生 企业社区
社区管理员
  • ASP.NET
  • .Net开发者社区
  • R小R
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告

.NET 社区是一个围绕开源 .NET 的开放、热情、创新、包容的技术社区。社区致力于为广大 .NET 爱好者提供一个良好的知识共享、协同互助的 .NET 技术交流环境。我们尊重不同意见,支持健康理性的辩论和互动,反对歧视和攻击。

希望和大家一起共同营造一个活跃、友好的社区氛围。

试试用AI创作助手写篇文章吧