java 爬虫　新浪微博　没有数据

M阳光 2017-01-11 01:50:23

爬虫新手，用jsoup写了个简单的程序。不知道为什么新浪微博的页面就是爬不下来。　是不是遗漏了什么步骤？
csdn可以

package com.ksource.spider.netSpider;



import java.io.IOException;

import java.util.LinkedList;

import java.util.Queue;



import org.jsoup.Connection.Response;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;



public class StartS {



	private static Queue<String> linkQueue = new LinkedList<String>();

	public static void main(String[] args) {

		try {

			Response response = executeLink("http://www.weibo.com/?c=spr_sinamkt_buy_srwj1_weibo_t111");

			

				String link = linkQueue.poll();

				Document document = response.parse();

				

				System.out.println(document.toString());

				System.out.println("--------------------------------------------------");

				

				

			

			

		} catch (IOException e) {

			// TODO Auto-generated catch block

			e.printStackTrace();

		}



	}







	private static Response executeLink(String href) throws IOException {

		Response response= Jsoup.connect(href)

		           .ignoreContentType(true)

		           .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36")

		           .timeout(12000) 

		           .followRedirects(true)

		           .execute();

		return response;

	}



}

...全文

606 11 打赏收藏转发到动态举报

写回复

用AI写文章

11 条回复

切换为时间正序

请发表友善的回复…

发表回复

M阳光 2017-01-12

打赏
举报

引用 8 楼 qq_17280849 的回复:

@M173475237 你里面就是一个对象类型的json,用fastjson就可以解析出来了

懂了。多谢～

M阳光 2017-01-12

打赏
举报

引用 7 楼 qq_17280849 的回复:

你可以先通过class=view爬取 String data = {"pid":"pl_unlogin_home_hotpersoncategory","js":[],"css":[],"html":"<div class=\"WB_cardwrap S_bg2\">\n <div class=\"DSC_text_b DSC_text_b1\">\n <div class=\"WB_cardtitle_b S_line2\">\n 。。。。。。。“} ,然后: JSONObject objs = JSONObject.parseObject(data);
String result=objs.getJSONObject("html");获取html标签的内容

FM.view是一个js函数啊。他通过调用这个函数把html输出到浏览器的。不知道为什么htmlunit没有解析出来这个东西

雨上小公举 2017-01-12

打赏
举报

@M173475237 你里面就是一个对象类型的json,用fastjson就可以解析出来了

雨上小公举 2017-01-12

打赏
举报

你可以先通过class=view爬取 String data = {"pid":"pl_unlogin_home_hotpersoncategory","js":[],"css":[],"html":"<div class=\"WB_cardwrap S_bg2\">\n <div class=\"DSC_text_b DSC_text_b1\">\n <div class=\"WB_cardtitle_b S_line2\">\n 。。。。。。。“} ,然后: JSONObject objs = JSONObject.parseObject(data); String result=objs.getJSONObject("html");获取html标签的内容

M阳光 2017-01-12

打赏
举报

引用 3 楼 qq_17280849 的回复:

 Document body = Jsoup.connect(curl).timeout(timeout).get();
            Elements productsTag = body.getElementsByClass("products");
String text =productsTag.text()+"";

//获取curl网站class=products的标签text

这是我用htmlunit抓到的。有什么好的方法可以解析出来吗？新浪为了爬虫也是操碎了心

FM.view({"ns":"pl.content.homeFeed.index","domid":"Pl_Official_MyProfileFeed__29","css":["style/css/module/list/comb_WB_feed_profile.css?version=aa44b85252d881b4"],"js":"page/js/pl/content/homeFeed/index.js?version=f3a6ca617210d1fb","html":"                <div class=\"WB_feed WB_feed_v3 WB_feed_v4\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n        <div style=\"position:relative;\" node-type=\"feedconfig\" data-queryfix=is_hot=1>\r\n            <div style=\"position:absolute;top:-110px;left:0;width:0;height:0;\" id=\"feedtop\" name=\"feedtop\"><\/div>\r\n        <\/div>\r\n                    \t        \t\t    \t\t    \t\t    \t\t    \t        \t<div  tbinfo=\"ouid=5710586189\" action-type=\"feed_list_item\" diss-data=\"\"  mid=\"4062721538071172\"  class=\"WB_cardwrap WB_feed_type S_bg2 WB_feed_vipcover \">\n        <div class=\"WB_feed_detail clearfix\" node-type=\"feed_content\"\n

M阳光 2017-01-12

打赏
举报

引用 3 楼 qq_17280849 的回复:

 Document body = Jsoup.connect(curl).timeout(timeout).get();
            Elements productsTag = body.getElementsByClass("products");
String text =productsTag.text()+"";

//获取curl网站class=products的标签text

请问这种内容应该怎么处理呢？

script charset="utf-8">FM.view({"pid":"pl_unlogin_home_hotpersoncategory","js":[],"css":[],"html":"<div class=\"WB_cardwrap S_bg2\">\n  <div class=\"DSC_text_b DSC_text_b1\">\n    <div class=\"WB_cardtitle_b S_line2\">\n 。。。。。。。

bcsflilong 2017-01-12

打赏
举报

引用 3 楼 qq_17280849 的回复:

 Document body = Jsoup.connect(curl).timeout(timeout).get();
            Elements productsTag = body.getElementsByClass("products");
String text =productsTag.text()+"";

//获取curl网站class=products的标签text

雨上小公举 2017-01-12

打赏
举报

 Document body = Jsoup.connect(curl).timeout(timeout).get();
            Elements productsTag = body.getElementsByClass("products");
String text =productsTag.text()+"";

//获取curl网站class=products的标签text

M阳光 2017-01-12