I need to query some HTML; the type of thing that is often called scraping. I am creating a document from a fragment in the clipboard. I have everything working except I am trying to enumerate the elements. I originally did that to verify that I have the HTML in the document. The problem is that I am not getting the first few elements. I am using IHTMLDOMNode::get_first and IHTMLDOMNode::get_next to enumerate the elements. The beginning of the input HTML is: <TABLE cellSpacing=0 cellPadding=0 width="100%" bgColor=white border=0> <TBODY> <TR> <TD vAlign=top><FONT size=1><A href="http://www.microsoft.com">Microsoft</A> The beginning of the output is: FONT /TD TD A I don't know where those elements are coming from but it is not the beginning of the document; the first element should be the table element. That is however only the beginning of the data; there is much more that follows it. My code is as follows; is there are reason why I am not getting the siblings starting from the table element?: MSHTML::IHTMLDocument2Ptr FromDocument; MSHTML::IHTMLDOMNodePtr AppendedNode; MSHTML::IHTMLDOMNodePtr ChildNode, SavedNode; MSHTML::IHTMLElementPtr FromBodyElement; MSHTML::IHTMLDOMNodePtr FromBodyNode; MSHTML::IHTMLElementPtr FromElement; MSHTML::IHTMLDOMNodePtr FromNode; std::string Name; BSTR bsName; std::ostringstream oss; HRESULT hr; // FromDocument has been Initialized using IPersistStreamInit:InitNew hr = FromDocument->get_body(&FromBodyElement); FromDocument->createElement(_bstr_t("Div"), &FromElement); FromElement->put_innerHTML(_bstr_t(Text.c_str())); FromBodyNode = FromBodyElement; FromNode = FromElement; hr = FromBodyNode->appendChild(FromNode, &AppendedNode); hr = AppendedNode->get_firstChild(&ChildNode); while (hr == S_OK && ChildNode != 0) { ChildNode->get_nodeName(&bsName); oss << Name << '\n'; SavedNode = ChildNode; hr = SavedNode->get_nextSibling(&ChildNode); }
I don't need the Div element that I appended to. I am now appending to the body element and using that to do the get_firstChild. The problem I described in my original question still occurs.