9 Feb 2009

Converting accentuated characters to plain ASCII

Posted by ged

Today, I wanted to improve our blog-title-to-permalink function, so that (French) accentuated characters are not simply stripped but rather converted to their non accentuated version. For example, “√©” would be converted to “e”.

After some googling and (slightly) tweaking what I found, here is the function I use:

noaccents_table = ''.join(map(chr, range(192))) + \
"AAAAAAACEEEEIIIIDNOOOOOxOUUUUYTsaaaaaaaceeeeiiiidnooooo/ouuuuyty"
def latin1_to_ascii(u_str):
    return u_str.encode('latin1', 'replace').translate(noaccents_table)

As you can see, it takes a unicode string as argument. Here how you use it:

>>> latin1_to_ascii(u'évidemment')
'evidemment'

Note for later: if I ever need to do it in a more generalized way (not only for latin1), the iconv module (http://pypi.python.org/pypi/iconv) might (or might not) be useful.

Tags:

Leave a Reply

Message: