请教如何抓取网页上的所有文字,详细请进

sdice 2005-08-23 06:55:40
例如CTRL+A拿到所有的文字信息,可以粘贴到写子版里的功能。
我用WebBrowser控件,目前是向空间发送IDM_SELECTALL和IDM_COPY消息模拟CTRL+A和CTRL+C的功能的,但是遇到某些无法复制的网页就没办法了,有什么好的解决方法吗??
...全文
613 点赞 收藏 18
写回复
18 条回复
切换为时间正序
当前发帖距今超过3年,不再开放新的回复
发表回复
sdice 2005-09-20
OO
回复
sdice 2005-09-20
问题解决,感谢各位!结贴
回复
sdice 2005-08-25
上面的方法试了,还是不行,涉及到框架就没办法,能不能从框架入手?哎。。。。。。
回复
goodboyws 2005-08-25
针对域,这段代码,多少有点作用
IHTMLDocument2* pDoc2;
CComBSTR tagName;
pElement->get_tagName(&tagName);
CString str = tagName;
str.MakeUpper();
if (str == "FRAME" || str == "IFRAME")
{
HRESULT hr;
IHTMLWindow2 *pHTMLWindow;
IHTMLFrameBase2* pHTMLFrameBase2;
hr =pElement->QueryInterface(IID_IHTMLFrameBase2, (void**)&pHTMLFrameBase2);
pElement->Release();
hr = pHTMLFrameBase2->get_contentWindow(&pHTMLWindow);
pHTMLFrameBase2->Release();
hr = pHTMLWindow->get_document(&pDoc2);
然后用IHTMLDocument2对域进行操作
回复
蒋晟 2005-08-25
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/document2/domain.asp
回复
蒋晟 2005-08-25
跨域的Frame访问?记得默认的安全设置里面这个是禁止的吧
回复
xiao_xiao_zi 2005-08-25
DOM~~~
回复
goodboyws 2005-08-24
试试这样,试试获取每个对象的文本
IHTMLElementCollection* pCollection;
pHTMLDocument->get_all(&pCollection);
long len;
pCollection->get_length(&len);
for (long l=0; l<len; l++)
{
VARIANT varIndex, var2;
VariantInit(&varIndex);
VariantInit(&var2);
varIndex.vt = VT_I4;
varIndex.lVal = l;
IDispatch pDisp;
pCollection->item( varIndex, var2, &pDisp );
IHTMLElement* pElem;
pDisp->QueryInterface( IID_IHTMLElement, (LPVOID*) &pElem );
BSTR bstrHTMLText;
pElem->get_outerText((&bstrHTMLText);
CString strText = bstrHTMLText;
SysFreeString( bstrHTMLText);
pElem->Release();
}
pCollection->Release();






回复
sdice 2005-08-24
继续等。。。。。
回复
sdice 2005-08-23
UP
回复
goodboyws 2005-08-23
涉及到跨域的Frame访问的话,好像没什么好招
回复
蒋晟 2005-08-23

Retrieving the HTML of the current selection
If you want to limit the HTML to just what a user has selected, instead of the entire document, we can use the IHTMLXxx COM interfaces. The first thing you need to do is get access to the IHTMLDocument interface for the current document. IWebBrowser2 gives you access using it's Document property. The Document property returns an IDispatch interface, so we need to QueryInterface the IDispatch interface for an IHTMLDocument interface, like so (raw C++):


IDispatch* pDocDisp = 0;
HRESULT hr = pWebBrowser->get_Document(&pDocDisp);

IHTMLDocument2* pDoc = 0;
hr = pDocDisp->QueryInterface(IID_IHTMLDocument2, (void**)&pDoc);
if (SUCCEEDED(hr)) {

//...

pDoc->Release();
}

pDocDisp->Release();

The IHTMLXxx interfaces follow the W3C DOM specification used for JavaScript very closely. If your familiar with those objects, the IHTMLXxx interface will be easy to grasp. In fact, if you know how to do something using JavaScript, you can duplicate it your compiled code using the IHTMLXxx interfaces.

That said, you can get the current selection as a IHTMLTxtRange from the document element. Once you have a text range, you can retrieve the plain text or HTML text as shown below:


IHTMLDocument2* pDoc = ...;

IHTMLSelectionObject* pSelection = 0;
HRESULT hr = pDoc->get_selection(&pSelection);
if (SUCCEEDED(hr)) {
IDispatch* pDispRange = 0;
hr = pSelection->createRange(&pDispRange);
if (SUCCEEDED(hr)) {
IHTMLTxtRange* pTextRange = 0;
hr = pDispRange->QueryInterface(IID_IHTMLTxtRange, (void**)&pTextRange);
if (SUCCEEDED(hr)) {
CComBSTR sText;
pTextRange->get_text(&sText);
// or
pTextRange->get_htmlText(&sText);
//...
pTextRange->Release();
}
pDispRange->Release();
}
pSelection->Release();
}

pDoc->Release();

apply get_text to the <Body> element or <Html> element may fail when the element is missing.

you can also use Microsoft Word as a converter. see http://engine.keeboo.com/admin/KeeBookCreator.txt.
回复
sdice 2005-08-23
找到问题了,用 楼上2位的方法一般的网页没什么问题,如果碰到个别的如163.com的邮箱页面就行不通了,望2位高手帮忙下。谢谢!
回复
goodboyws 2005-08-23
代码没有问题,看来网页的事情,用1楼的方法吧
回复
sdice 2005-08-23
上面发错,get_outerHTML和get_outerText才对
回复
sdice 2005-08-23
get_htmlText就能拿到页面源代码,get_outerText就总是返回空
回复
sdice 2005-08-23
pBody->get_outerText为什么我拿到的bstrHTMLText总是空的
代码段如下:

IHTMLDocument2 *pHTMLDocument=NULL;
IHTMLElement* pBody;
if (!(pHTMLDocument = (IHTMLDocument2*)m_ie2.GetDocument()))
break;
hr = pHTMLDocument->get_body(&pBody);
if(SUCCEEDED(hr))
{
BSTR bstrHTMLText;
hr = pBody->get_outerText(&bstrHTMLText);
CString strText = bstrHTMLText;
SysFreeString( bstrHTMLText);
pBody->Release();
}
回复
goodboyws 2005-08-23
一般直接调用pDoc->get_body, pBody->get_outerText即可,不必选中,body元素不存在的情况不多
IDispatch* pDocDisp = 0;
HRESULT hr = pWebBrowser->get_Document(&pDocDisp);

IHTMLDocument2* pDoc = 0;
hr = pDocDisp->QueryInterface(IID_IHTMLDocument2, (void**)&pDoc);
if (SUCCEEDED(hr)) {
IHTMLElement* pBody;
hr = pDoc->get_body(&pBody);
if SUCCEEDED(hr))
{
BSTR bstrHTMLText;
hr = pBody->get_outerText(&bstrHTMLText);
//这个就是网页文本
CString strText = bstrHTMLText;
......
SysFreeString( bstrHTMLText);
pBody->Release();
}


}
pDoc->Release();
}

pDocDisp->Release();
回复
相关推荐
发帖
HTML/XML
创建于2007-09-28

3055

社区成员

VC/MFC HTML/XML
申请成为版主
帖子事件
创建了帖子
2005-08-23 06:55
社区公告
暂无公告