c# - How to remove nodes using HTML agility pack and XPath so as to clean the HTML page -
i need extract text webpages related business news. html page follows..
<html> <body> <div> <p> <span>desired content - 1</span></p> <p> <span>desired content - 2</span></p> <p> <span>desired content - 3</span></p> </div> </body> </html>"
i have sample stored in string can take me desired content -1 directly, can collect content. need collect desired content -2 , 3.
for tried current location i.e in span node of desired content -1 used parentof , moved external node i.e para node , got content need entire desired content in div. how it? might ask me go div directly using parentof.parentof.span. specific example, need general idea.
mostly news articles have desired content in division , go directly nested inner node of division. need come out of inner nodes till encounter division , innertext.
i using xpath , htmlagilitypack.
xpath using -
variable = doc.documentnode.selectsinglenode("//*[contains(text(),'" + searchdata + "')]").parentnode.parentnode.innertext;
here "searchdata" variable holding sample of desired content -1 used searching node having news in entire body of webpage.
what thinking clean webpages , have main tags html, body, tables, division , paragraphs no spans , other formating elements. other website might use spans instead of divs not sure how implement requirement.
basic requirement extract news content different webpages(almost 250 different websites). can not code specific each webpage..i need generic method.
any ideas appreciated. thank you.
this xpath expression selects innermost div
element $searchdata
variable reference value part of string value.
//div[contains(.,$searchdata)] [not(.//div[contains(.,$searchdata)])]
Comments
Post a Comment