Interpreting a Relative path in a URL -
i'm writing 'webcrawler' in python takes url , depth-first search following links down limited depth. problem i'm having interpreting relative paths in urls.
on page http://learnyouahaskell.com/introduction/ have @ "starting out" link; looks <a href="starting-out" class="nxtlink">starting out</a>
. how can determine whether link refers "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? second 1 correct according browser.
yet on page http://math.colgate.edu/~mionescu/math399s11/ there link <a href="finalprojects.pdf">here</a>
resolves "http://math.colgate.edu/~mionescu/math399s11/finalprojects.pdf".
can explain inconsistency me? how can determine how these paths should resolved in crawler?
the reason 'apparent' inconsistency learnyouahaskell site using <base href="">
tag in source. directs domainless hrefs use base starting point.
without base tag have appeared expected (the first link post) , acted math.colgate.edu
link.
Comments
Post a Comment