Bugs (?) in 1.3b while spidering

Bug Reporting and Feature Suggestions /Improvements go here.

Moderator: pgolovko

Bugs (?) in 1.3b while spidering

Postby Alain » 12/31/2006 1:29 pm

I am receiving quite some 404's on links like this
Code: Select all
http://site.co.uk/x.oSrc;

where I believe ADP is trying to parse and follow embedded JavaScript.

Another parsing problem is with links like
Code: Select all
http://www.site.com/_frameworks_static/js/scriptaculous.js?load=effects

where ADP receives a 200, but not TEXT/HTML. I thought getting around
this by adding 'js' to the Non Follow Match list (which is like this here)
Code: Select all
/.(jpg|jpeg|gif|png|bmp|pdf|swf|css|js)$/i

but that doesn't help. I didn't take off the $, but wanted to see
what's the 'official opinion' on these bugs (if I am allowed to call
them so) are.

TIA,
Alain
Alain
 
Posts: 11
Joined: 12/30/2006 1:58 pm


Postby pgolovko » 12/31/2006 11:23 pm

So, did ADP record this URL to the database or not?
Code: Select all
http://www.site.com/_frameworks_static/js/scriptaculous.js?load=effects

It receives a 200 OK code, because its there. It doesnt receive TEXT/HTML because it isnt one. ADP should just skip it, and go further.
Whats the problem with it?


The problem with these:
Code: Select all
http://site.co.uk/x.oSrc;

is most likely ADP not extracting the links from that particular JavaScript. It a bit hard to come up with a regular expression that would match links from all the JavaScript junk people like to code. I'll see what I can do about this problem :scratch:
__________________
Best regards,
Pavel Golovko
User avatar
pgolovko
 
Posts: 494
Joined: 03/25/2006 1:23 am
Location: Somewhere on planet Earth ....

Postby Alain » 01/01/2007 2:39 am

admin wrote:So, did ADP record this URL to the database or not?
Code: Select all
http://www.site.com/_frameworks_static/js/scriptaculous.js?load=effects

It receives a 200 OK code, because its there. It doesnt receive TEXT/HTML because it isnt one. ADP should just skip it, and go further.
Whats the problem with it?


No problem, other than unnecessary network traffic and time consumption. Which can be avoided, I should add.

For a 5 kB js-file, this is no issue. But this can become one, if ADP downloads 100 MB PDF's. Unlikely ? Not so. Many web analytics people add some ? with a code for tracking purposes.

So you really should strip anything after and including a '?' from an URL, and then apply the Non Follow Match.

The problem with these:
Code: Select all
http://site.co.uk/x.oSrc;

is most likely ADP not extracting the links from that particular JavaScript. It a bit hard to come up with a regular expression that would match links from all the JavaScript junk people like to code.


Well, this can't be done with RE's - js is a programming language and thus quite unparseable. But again, in order to avoid unnecessary traffic, why
don't you apply a RE to check if you at least have a valid (or reasonable) URL before going out to the net. In this example, an URL which terminates in ';' could be skipped.

Btw: Sorry for posting this in the General Forum; please shift over to the Bugs Arena.

Happy New Year !

Alain
Alain
 
Posts: 11
Joined: 12/30/2006 1:58 pm

Postby pgolovko » 01/01/2007 2:47 am

Thank you Alain, I will look deeper into these URL related problems for the next ADP release.
There have been numerous bug reports lately, which make me think that I might not be able to release next ADP version in January, though I will try to do my best. The more reports I get from ADP users, the better ADP will become. I thank you again.
__________________
Best regards,
Pavel Golovko
User avatar
pgolovko
 
Posts: 494
Joined: 03/25/2006 1:23 am
Location: Somewhere on planet Earth ....

Postby trapprs » 01/01/2007 1:35 pm

you could make it so that when it sees a file with an extension different than what it is supposed to index, then it will skip the file.

Side Note: make sure you add a code to be able to crawl PHP (such as the /text\/\html/ )
trapprs
 
Posts: 67
Joined: 11/30/2006 5:15 pm


Return to Bug Reporting

Who is online

Users browsing this forum: No registered users and 0 guests

cron