请教如何抓取网页上的所有文字，详细请进

sdice 2005-08-23 06:55:40

例如CTRL+A拿到所有的文字信息，可以粘贴到写子版里的功能。
我用WebBrowser控件，目前是向空间发送IDM_SELECTALL和IDM_COPY消息模拟CTRL+A和CTRL+C的功能的，但是遇到某些无法复制的网页就没办法了，有什么好的解决方法吗？？

...全文

1056 18 打赏收藏转发到动态举报

写回复

用AI写文章

18 条回复

切换为时间正序

请发表友善的回复…

发表回复

sdice 2005-09-20

打赏
举报

sdice 2005-09-20

打赏
举报

问题解决，感谢各位！结贴

sdice 2005-08-25

打赏
举报

上面的方法试了，还是不行，涉及到框架就没办法，能不能从框架入手？哎。。。。。。

goodboyws 2005-08-25

打赏
举报

针对域，这段代码，多少有点作用
IHTMLDocument2* pDoc2;
CComBSTR tagName;
pElement->get_tagName(&tagName);
CString str = tagName;
str.MakeUpper();
if (str == "FRAME" || str == "IFRAME")
{
HRESULT hr;
IHTMLWindow2 *pHTMLWindow;
IHTMLFrameBase2* pHTMLFrameBase2;
hr =pElement->QueryInterface(IID_IHTMLFrameBase2, (void**)&pHTMLFrameBase2);
pElement->Release();
hr = pHTMLFrameBase2->get_contentWindow(&pHTMLWindow);
pHTMLFrameBase2->Release();
hr = pHTMLWindow->get_document(&pDoc2);
然后用IHTMLDocument2对域进行操作

蒋晟 2005-08-25

打赏
举报

http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/document2/domain.asp

蒋晟 2005-08-25

打赏
举报

跨域的Frame访问？记得默认的安全设置里面这个是禁止的吧

xiao_xiao_zi 2005-08-25

打赏
举报

DOM~~~

goodboyws 2005-08-24

打赏
举报

试试这样,试试获取每个对象的文本
IHTMLElementCollection* pCollection;
pHTMLDocument->get_all(&pCollection);
long len;
pCollection->get_length(&len);
for (long l=0; l<len; l++)
{
VARIANT varIndex, var2;
VariantInit(&varIndex);
VariantInit(&var2);
varIndex.vt = VT_I4;
varIndex.lVal = l;
IDispatch pDisp;
pCollection->item( varIndex, var2, &pDisp );
IHTMLElement* pElem;
pDisp->QueryInterface( IID_IHTMLElement, (LPVOID*) &pElem );
BSTR bstrHTMLText;
pElem->get_outerText((&bstrHTMLText);
CString strText = bstrHTMLText;
SysFreeString( bstrHTMLText);
pElem->Release();
}
pCollection->Release();

sdice 2005-08-24

打赏
举报

继续等。。。。。

sdice 2005-08-23

打赏
举报

goodboyws 2005-08-23

打赏
举报

涉及到跨域的Frame访问的话，好像没什么好招

蒋晟 2005-08-23

打赏
举报

Retrieving the HTML of the current selection
If you want to limit the HTML to just what a user has selected, instead of the entire document, we can use the IHTMLXxx COM interfaces. The first thing you need to do is get access to the IHTMLDocument interface for the current document. IWebBrowser2 gives you access using it's Document property. The Document property returns an IDispatch interface, so we need to QueryInterface the IDispatch interface for an IHTMLDocument interface, like so (raw C++):

IDispatch* pDocDisp = 0;
HRESULT hr = pWebBrowser->get_Document(&pDocDisp);

IHTMLDocument2* pDoc = 0;
hr = pDocDisp->QueryInterface(IID_IHTMLDocument2, (void**)&pDoc);
if (SUCCEEDED(hr)) {

//...

pDoc->Release();
}

pDocDisp->Release();

The IHTMLXxx interfaces follow the W3C DOM specification used for JavaScript very closely. If your familiar with those objects, the IHTMLXxx interface will be easy to grasp. In fact, if you know how to do something using JavaScript, you can duplicate it your compiled code using the IHTMLXxx interfaces.

That said, you can get the current selection as a IHTMLTxtRange from the document element. Once you have a text range, you can retrieve the plain text or HTML text as shown below:

IHTMLDocument2* pDoc = ...;

IHTMLSelectionObject* pSelection = 0;
HRESULT hr = pDoc->get_selection(&pSelection);
if (SUCCEEDED(hr)) {
IDispatch* pDispRange = 0;
hr = pSelection->createRange(&pDispRange);
if (SUCCEEDED(hr)) {
IHTMLTxtRange* pTextRange = 0;
hr = pDispRange->QueryInterface(IID_IHTMLTxtRange, (void**)&pTextRange);
if (SUCCEEDED(hr)) {
CComBSTR sText;
pTextRange->get_text(&sText);
// or
pTextRange->get_htmlText(&sText);
//...
pTextRange->Release();
}
pDispRange->Release();
}
pSelection->Release();
}

pDoc->Release();

apply get_text to the <Body> element or <Html> element may fail when the element is missing.

you can also use Microsoft Word as a converter. see http://engine.keeboo.com/admin/KeeBookCreator.txt.

sdice 2005-08-23

打赏
举报

找到问题了，用楼上2位的方法一般的网页没什么问题，如果碰到个别的如163.com的邮箱页面就行不通了，望2位高手帮忙下。谢谢！

goodboyws 2005-08-23

打赏
举报

代码没有问题，看来网页的事情，用1楼的方法吧

sdice 2005-08-23

打赏
举报

上面发错，get_outerHTML和get_outerText才对

sdice 2005-08-23

打赏
举报

get_htmlText就能拿到页面源代码，get_outerText就总是返回空

sdice 2005-08-23

打赏
举报

pBody->get_outerText为什么我拿到的bstrHTMLText总是空的
代码段如下：

IHTMLDocument2 *pHTMLDocument=NULL;
IHTMLElement* pBody;
if (!(pHTMLDocument = (IHTMLDocument2*)m_ie2.GetDocument()))
break;
hr = pHTMLDocument->get_body(&pBody);
if(SUCCEEDED(hr))
{
BSTR bstrHTMLText;
hr = pBody->get_outerText(&bstrHTMLText);
CString strText = bstrHTMLText;
SysFreeString( bstrHTMLText);
pBody->Release();
}

goodboyws 2005-08-23

打赏
举报

一般直接调用pDoc->get_body, pBody->get_outerText即可,不必选中,body元素不存在的情况不多
IDispatch* pDocDisp = 0;
HRESULT hr = pWebBrowser->get_Document(&pDocDisp);

IHTMLDocument2* pDoc = 0;
hr = pDocDisp->QueryInterface(IID_IHTMLDocument2, (void**)&pDoc);
if (SUCCEEDED(hr)) {
IHTMLElement* pBody;
hr = pDoc->get_body(&pBody);
if SUCCEEDED(hr))
{
BSTR bstrHTMLText;
hr = pBody->get_outerText(&bstrHTMLText);
//这个就是网页文本
CString strText = bstrHTMLText;
......
SysFreeString( bstrHTMLText);
pBody->Release();
}

}
pDoc->Release();
}

pDocDisp->Release();

本文您将学到的东西包括：scrapy爬虫的设置requests（一个用来发送HTTP请求的简单库）BeautifulSoup（一个从HTML和XML中解析数据的库）MongoDB的用法MongoBooster可视化工具注意：很多人学Python过程中会遇到各种烦恼问题，没有人帮答疑容易放弃。为此小编建了个Python全栈免费答疑.裙：七衣衣九七七巴而五（数字的谐音）转换下可以找到了，不懂的问题有...

Qhtml问题，高手请进，或者搞过html分析的请进 T我想写个程序，能够获取html叶面里的表单，然后再在程序里面显示出来，但现在的问题是有些表单里的数据（网页里预定义，存在array里面，有jsscript把它放到表单里面去）没法子通过分析单纯的网页得到，所以我想问问该怎么办？并且还想问一下，有什么办法能做到点击html的submit的时候，截取

Word中的“选中”方法知多少？一、常见的“选中”方法： ü 全选（快捷键Ctrl+A）：就是全部选中文档内的所有内容。这所有内容包括：文字、表格、图形、图像等可见的和不可见的标记。 ü 按住Shift＋Page Down从光标处向下选中一屏，Shift＋Page Up从光标处向上选中一屏。 ü 按住Shift+左选中光标左边第一个字符，Shift+右选中光标右边第一个字符，Shift...

一、Word中的“选中”方法知多少？ 1.常见的“选中”方法： ◆ 全选（快捷键Ctrl+A）：就是全部选中文档内的所有内容。这所有内容包括：文字、表格、图形、图像等可见的和不可见的标记。 ◆ 按住Shift＋Page Down从光标处向下选中一屏，Shift＋Page Up从光标处向上选中一屏。 ◆ 按住Shift+左选中光标左边第一个字符，Shift+右选中光标右边第一个字符，Sh