python - beautiful soup malformed start tag error -


>>> soup = beautifulsoup( data )     traceback (most recent call last):       file "<stdin>", line 1, in <module>       file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1499, in __init__         beautifulstonesoup.__init__(self, *args, **kwargs)       file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1230, in __init__         self._feed(ishtml=ishtml)       file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1263, in _feed         self.builder.feed(markup)       file "/usr/lib/python2.6/htmlparser.py", line 108, in feed         self.goahead(0)       file "/usr/lib/python2.6/htmlparser.py", line 148, in goahead         k = self.parse_starttag(i)       file "/usr/lib/python2.6/htmlparser.py", line 226, in parse_starttag         endpos = self.check_for_whole_start_tag(i)       file "/usr/lib/python2.6/htmlparser.py", line 301, in check_for_whole_start_tag         self.error("malformed start tag")       file "/usr/lib/python2.6/htmlparser.py", line 115, in error         raise htmlparseerror(message, self.getpos())     htmlparser.htmlparseerror: malformed start tag, @ line 5518, column 822    >>> each in l[5515:5520]: ...     print each ...  <script>    registerimage("original_image", "http://ecx.images-amazon.com/images/i/41h7uhc1jml._sl500_aa240_.jpg","<a href="+'"'+"http://rads.stackoverflow.com/amzn/click/1592406017"+'"'+" target="+'"'+"amazonhelp"+'"'+" onclick="+'"'+"return amz_js_popwin(this.href,'amazonhelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+"  ><img onload="+'"'+"if (typeof uet == 'function') { uet('af'); }"+'"'+" src="+'"'+"http://ecx.images-amazon.com/images/i/41h7uhc1jml._sl500_aa240_.jpg"+'"'+" id="+'"'+"prodimage"+'"'+"  width="+'"'+"240"+'"'+" height="+'"'+"240"+'"'+"   border="+'"'+"0"+'"'+" alt="+'"'+"life, on line: chef's story of chasing greatness, facing death, , redefining way eat"+'"'+" onmouseover="+'"'+""+'"'+" /></a>", "<br /><a href="+'"'+"http://rads.stackoverflow.com/amzn/click/1592406017"+'"'+" target="+'"'+"amazonhelp"+'"'+" onclick="+'"'+"return amz_js_popwin(this.href,'amazonhelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+"  >see larger image</a>", "");   var ivstrings = new object(); </script> >>>  >>> l[5518-1][822] 'h' >>>  

note : using python 2.6.5 on ubuntu 10.04

isn't beutifulsoup supposed ignore script tags ?
cant figure out way out of :(
suggestions ??

pyparsing has html tag support makes more robust scripts straight re's. , since doesn't try parse/process entire html body instead looks matching string expressions, can handle badly formed html:

html = """<script>     registerimage("original_image",  "this closing </script> tag in quotes" etc.... </script> """  # code strip <script> tags html page pyparsing import makehtmltags,skipto,quotedstring  script,scriptend = makehtmltags("script") scriptbody = script + skipto(scriptend, ignore=quotedstring) + scriptend  descriptedhtml = scriptbody.suppress().transformstring(html) 

depending on kind of html scraping trying do, might able whole thing using pyparsing.


Comments

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

visual c++ - Using relative values in array sorting ( asm ) -