python - beautiful soup malformed start tag error -
>>> soup = beautifulsoup( data ) traceback (most recent call last): file "<stdin>", line 1, in <module> file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1499, in __init__ beautifulstonesoup.__init__(self, *args, **kwargs) file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1230, in __init__ self._feed(ishtml=ishtml) file "/usr/lib/pymodules/python2.6/beautifulsoup.py", line 1263, in _feed self.builder.feed(markup) file "/usr/lib/python2.6/htmlparser.py", line 108, in feed self.goahead(0) file "/usr/lib/python2.6/htmlparser.py", line 148, in goahead k = self.parse_starttag(i) file "/usr/lib/python2.6/htmlparser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) file "/usr/lib/python2.6/htmlparser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") file "/usr/lib/python2.6/htmlparser.py", line 115, in error raise htmlparseerror(message, self.getpos()) htmlparser.htmlparseerror: malformed start tag, @ line 5518, column 822 >>> each in l[5515:5520]: ... print each ... <script> registerimage("original_image", "http://ecx.images-amazon.com/images/i/41h7uhc1jml._sl500_aa240_.jpg","<a href="+'"'+"http://rads.stackoverflow.com/amzn/click/1592406017"+'"'+" target="+'"'+"amazonhelp"+'"'+" onclick="+'"'+"return amz_js_popwin(this.href,'amazonhelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" ><img onload="+'"'+"if (typeof uet == 'function') { uet('af'); }"+'"'+" src="+'"'+"http://ecx.images-amazon.com/images/i/41h7uhc1jml._sl500_aa240_.jpg"+'"'+" id="+'"'+"prodimage"+'"'+" width="+'"'+"240"+'"'+" height="+'"'+"240"+'"'+" border="+'"'+"0"+'"'+" alt="+'"'+"life, on line: chef's story of chasing greatness, facing death, , redefining way eat"+'"'+" onmouseover="+'"'+""+'"'+" /></a>", "<br /><a href="+'"'+"http://rads.stackoverflow.com/amzn/click/1592406017"+'"'+" target="+'"'+"amazonhelp"+'"'+" onclick="+'"'+"return amz_js_popwin(this.href,'amazonhelp','width=700,height=600,resizable=1,scrollbars=1,toolbar=0,status=1');"+'"'+" >see larger image</a>", ""); var ivstrings = new object(); </script> >>> >>> l[5518-1][822] 'h' >>>
note : using python 2.6.5 on ubuntu 10.04
isn't beutifulsoup supposed ignore script tags ?
cant figure out way out of :(
suggestions ??
pyparsing has html tag support makes more robust scripts straight re's. , since doesn't try parse/process entire html body instead looks matching string expressions, can handle badly formed html:
html = """<script> registerimage("original_image", "this closing </script> tag in quotes" etc.... </script> """ # code strip <script> tags html page pyparsing import makehtmltags,skipto,quotedstring script,scriptend = makehtmltags("script") scriptbody = script + skipto(scriptend, ignore=quotedstring) + scriptend descriptedhtml = scriptbody.suppress().transformstring(html)
depending on kind of html scraping trying do, might able whole thing using pyparsing.
Comments
Post a Comment