c# - How to remove nodes using HTML agility pack and XPath so as to clean the HTML page -

- January 15, 2015

i need extract text webpages related business news. html page follows..

<html>       <body>     <div>     <p> <span>desired content - 1</span></p>     <p> <span>desired content - 2</span></p>     <p> <span>desired content - 3</span></p>     </div>   </body> </html>"

i have sample stored in string can take me desired content -1 directly, can collect content. need collect desired content -2 , 3.

for tried current location i.e in span node of desired content -1 used parentof , moved external node i.e para node , got content need entire desired content in div. how it? might ask me go div directly using parentof.parentof.span. specific example, need general idea.

mostly news articles have desired content in division , go directly nested inner node of division. need come out of inner nodes till encounter division , innertext.

i using xpath , htmlagilitypack.

xpath using -

variable = doc.documentnode.selectsinglenode("//*[contains(text(),'" + searchdata + "')]").parentnode.parentnode.innertext;

here "searchdata" variable holding sample of desired content -1 used searching node having news in entire body of webpage.

what thinking clean webpages , have main tags html, body, tables, division , paragraphs no spans , other formating elements. other website might use spans instead of divs not sure how implement requirement.

basic requirement extract news content different webpages(almost 250 different websites). can not code specific each webpage..i need generic method.

any ideas appreciated. thank you.

this xpath expression selects innermost div element $searchdata variable reference value part of string value.

//div[contains(.,$searchdata)]      [not(.//div[contains(.,$searchdata)])]

Search This Blog

JNI

c# - How to remove nodes using HTML agility pack and XPath so as to clean the HTML page -

Comments

Post a Comment

Popular posts from this blog

razor - Is this a bug in WebMatrix PageData? -

c# - How to set Z index when using WPF DrawingContext? -

visual c++ - Using relative values in array sorting ( asm ) -