抓取百度页面乱码问题

Aimarzhang 2008-01-21 04:26:44

做了一个测试程序，抓取百度的搜索结果页面。如下代码：
Encoding encoding = Encoding.GetEncoding("gb2312");

Uri uri = new Uri("http://www.baidu.com/s?wd=美女&cl=3");
System.Net.WebRequest myrequest = System.Net.WebRequest.Create(uri);
System.Net.WebResponse myresponse;

myresponse = myrequest.GetResponse();
System.IO.Stream st = myresponse.GetResponseStream();
System.IO.StreamReader sr = new System.IO.StreamReader(st, encoding);
string html = sr.ReadToEnd();
sr.Close();
st.Close();
得到html为乱码：抓取结果的<title>部分如下：
<title>百度搜索_缇庡コ </title>\r

明显是乱码。

我还做了如下测试：
System.IO.StreamReader sr = new System.IO.StreamReader(st, System.Text.Encoding.Default);
结果是一样的乱码。

兄弟们知道怎么解决这个问题不？

...全文

306 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

Aimarzhang 2008-01-21

打赏
举报

解决了，百度url中的关键字采用的是GB2312编码
多谢大家

Aimarzhang 2008-01-21

打赏
举报

TNT_1st_excellence：
确实如你所说，但是在百度中，如果我把汉字转成unicode码和百度的是不一样的，百度他是自己处理的编码？

TNT_1st_excellence 2008-01-21

打赏
举报

http://www.baidu.com/s?wd=美女&cl=3

是由于你输入的是＂汉字－－美女＂，当你输入英文时，是不会出现乱吗．
百度输入汉字时，可能有编码的处理．

http://www.baidu.com/s?ie=gb2312&bs=%C3%C0%C8%CB&sr=&z=&cl=3&f=8&wd=%C3%C0%C5%AE&ct=0

fowolf 2008-01-21

打赏
举报

/// <summary>
/// string 网上HTML代码抓取程序
/// </summary>
/// <param name="a_strUrl">string 抓取的网址 </param>
/// <returns>返回HTML代码</returns>
public string Get_Http_10(string a_strUrl)
{
try
{
WebRequest hwr = HttpWebRequest.Create(a_strUrl);
HttpWebResponse rep = (HttpWebResponse)hwr.GetResponse();
StreamReader sr = new StreamReader(rep.GetResponseStream(), Encoding.GetEncoding("gb2312"));
StringBuilder response = new StringBuilder();
string temp = string.Empty;
while ((temp = sr.ReadLine()) != null)
{
response.Append(temp + "\r\n");
}
errorStr = false;

sr.Dispose();

return response.ToString();
}
catch (Exception exception1)
{
errorStr = true;
return ("错误：" + exception1.Message);
}
}

Aimarzhang 2008-01-21