Sunday, April 12, 2009

List Comprehension for filtering

In the previous blog post, I wrote about a list of domains I use in checking whether a domain looks valid for indexedbygoogle.com. The list of domains is retrieved from here: http://data.iana.org/TLD/tlds-alpha-by-domain.txt with much thanks to them. They seem to update the list on a regular basis so they'll do as a resource for now. Here is the bit of code used to download and make the file into a list. I want to exclude the first line which is a comment (the first char is a #) and any empty lines.

import urllib2

url = 'http://data.iana.org/TLD/tlds-alpha-by-domain.txt'

domain_file = urllib2.urlopen(url).read()
domain_list = domain_file.split('\n')
DOMAINS = [tld for tld in domain_list if not (tld.startswith('#') or tld == '')]

The interesting bit is in the last line. The if part of the list comprehension will filter out any blanks and any list items that start with a #.

It does not make sense to make this part of the CGI app itself since the code above will run whenever a user uses indexedbygoogle.com. A better alternative might be to pickle DOMAINS to a file and load it on demand, updating the contents of DOMAINS daily or weekly via a cron job.

Finally, here it is condensed into a one liner but, keep in mind that this might not be the best way to write it since it sacrifices readability and clarity for less code. Clarity should always trump brevity! :)

DOMAINS = [tld for tld in urllib.urlopen('http://data.iana.org/TLD/tlds-alpha-by-domain.txt').read().split('\n') if not (tld.startswith('#') or tld == '')]

Labels: , , ,

Wednesday, April 08, 2009

Urlparse can't do everything

One of the things I wanted indexedbygoogle.com to do was discard obvious errors when a URL is entered. For e.g., site instead of site.com (a weird typo but bare with me). This would be a very shallow check. Nothing major like downloading the page, checking the header, etc. Just a simple domain check so things like abdd or dodah/post were not processed, i.e., check if the .com or .net (etc.) was missing.

I thought urlparse would do the job. Feed it the URL and it would spit out the tuple breaking the URL into its component parts. But that did not work. It surprised me that urlparse would fail on a domain that did not include the scheme where I would expect it to parse that successfully.

For example:

>>> from urlparse import urlparse
>>> urlparse('google.com')
('', '', 'google.com', '', '', '')
>>>
The domain should be the second item in the tuple, the network location, but its the third item, the path. Including the scheme however parses the URL correctly.

>>> urlparse('http://google.com')
('http', 'google.com', '', '', '', '')
>>>

Here, the scheme 'http' and the network location 'google.com' are the first and second items in the tuple and things are as they should be.

So I had to include the scheme if I wanted to parse the URL. No problem. A simple line like so solved the issue:
>>> if not url.lower().startswith('http://'): site='http://'+site
But at this point, I was nowhere near a solution because a url like 'http://garbage' was also parsed with unsatisfactory results:
>>> urlparse('http://garbage')
('http', 'garbage', '', '', '', '')
I wanted 'garbage' to be recognized as the path and not the network location, since it was missing the top level domain(.com, .net, etc). But you know what, that is not what urlparse is designed to do. In the end, I had to come up with my own algorithm. I still needed to use urlparse but with a few extra lines of code thrown in to validate the network location. I did not want to go nuts here. It was enough that the URL had a valid top level domain. For e.g. 'google.ca' would pass, but 'google.commo' or 'google' would fail.

In the end, I got a list containing all 250+ top level domains from http://data.iana.org/TLD/tlds-alpha-by-domain.txt. I used urlparse to parse the url so that I could isolate the network location into a variable called site . The following bit of code did the rest:
from urlparse import urlparse

DOMAINS = ['com','net','edu','gov','info'] # etc...

define is_domain(url):
scheme, site, a, b, c, d = urlparse(url)
try:
domain, tld = site.rsplit('.',1)
except ValueError:
return False # an error here signifies a url without a top level domain.
# e.g. google was entered instead of google.com
if tld.lower() in DOMAINS:
return True
return False


And the lesson I learned from all this: Python is explicit. Don't expect the standard library to do something it was not supposed to do. My assumption was that because urlparse parsed the url into its component parts, it would automatically recognize erroneous network locations. But it did not and in the end, I wrote a function that did what I wanted and it was easy and fun.

Labels: , , ,