Interpreting a Relative path in a URL -

- February 15, 2010

i'm writing 'webcrawler' in python takes url , depth-first search following links down limited depth. problem i'm having interpreting relative paths in urls.

on page http://learnyouahaskell.com/introduction/ have @ "starting out" link; looks <a href="starting-out" class="nxtlink">starting out</a>. how can determine whether link refers "http://learnyouahaskell.com/introduction/starting-out" or "http://learnyouahaskell.com/starting-out"? second 1 correct according browser.

yet on page http://math.colgate.edu/~mionescu/math399s11/ there link <a href="finalprojects.pdf">here</a> resolves "http://math.colgate.edu/~mionescu/math399s11/finalprojects.pdf".

can explain inconsistency me? how can determine how these paths should resolved in crawler?

the reason 'apparent' inconsistency learnyouahaskell site using <base href=""> tag in source. directs domainless hrefs use base starting point.

without base tag have appeared expected (the first link post) , acted math.colgate.edu link.

Search This Blog

JNI

Interpreting a Relative path in a URL -

Comments

Post a Comment

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

visual c++ - Using relative values in array sorting ( asm ) -