One of the things I wanted indexedbygoogle.com to do was discard obvious errors when a URL is entered. For e.g., site instead of site.com (a weird typo but bare with me). This would be a very shallow check. Nothing major like downloading the page, checking the header, etc. Just a simple domain check so things like abdd or dodah/post were not processed, i.e., check if the .com or .net (etc.) was missing.
I thought urlparse would do the job. Feed it the URL and it would spit out the tuple breaking the URL into its component parts. But that did not work. It surprised me that urlparse would fail on a domain that did not include the scheme where I would expect it to parse that successfully.
For example:
>>> from urlparse import urlparse
>>> urlparse('google.com')
('', '', 'google.com', '', '', '')
>>>
The domain should be the second item in the tuple, the network location, but its the third item, the path. Including the scheme however parses the URL correctly.
>>> urlparse('http://google.com')
('http', 'google.com', '', '', '', '')
>>>
Here, the scheme 'http' and the network location 'google.com' are the first and second items in the tuple and things are as they should be.
So I had to include the scheme if I wanted to parse the URL. No problem. A simple line like so solved the issue:
>>> if not url.lower().startswith('http://'): site='http://'+site
But at this point, I was nowhere near a solution because a url like 'http://garbage' was also parsed with unsatisfactory results:
>>> urlparse('http://garbage')
('http', 'garbage', '', '', '', '')
I wanted 'garbage' to be recognized as the path and not the network location, since it was missing the top level domain(.com, .net, etc). But you know what, that is not what urlparse is designed to do. In the end, I had to come up with my own algorithm. I still needed to use urlparse but with a few extra lines of code thrown in to validate the network location. I did not want to go nuts here. It was enough that the URL had a valid top level domain. For e.g. 'google.ca' would pass, but 'google.commo' or 'google' would fail.
In the end, I got a list containing all 250+ top level domains from
http://data.iana.org/TLD/tlds-alpha-by-domain.txt. I used urlparse to parse the url so that I could isolate the network location into a variable called site . The following bit of code did the rest:
from urlparse import urlparse
DOMAINS = ['com','net','edu','gov','info'] # etc...
define is_domain(url):
scheme, site, a, b, c, d = urlparse(url)
try:
domain, tld = site.rsplit('.',1)
except ValueError:
return False # an error here signifies a url without a top level domain.
# e.g. google was entered instead of google.com
if tld.lower() in DOMAINS:
return True
return False
And the lesson I learned from all this: Python is explicit. Don't expect the standard library to do something it was not supposed to do. My assumption was that because urlparse parsed the url into its component parts, it would automatically recognize erroneous network locations. But it did not and in the end, I wrote a function that did what I wanted and it was easy and fun.
Labels: cgi, indexedbygoogle.com, python, urlparse