unicode – Intrepid Blog

When I want to generate usernames from real names, which can contain non-ascii characters, you can’t simply ignore the unicode characters. For instance, danielle@blaat.org is the right e-mail address for Daniëlle, danille@blaat.org isn’t.

There’s trick. Unicode has got a single code for ë itself, but it has also got a code which (simplified) adds ¨ on top of the previous character. The unicode standard defines a normal form in which (at least) all such characters, which can be, are represented using such modifiers. If you then simply ignore the non-ascii representable codes, you’ll get the desired result.

In python: unicodedata.normalize('NFKD', txt).encode('ASCII', 'ignore').

However, this isn’t the right solution. For instance, in german, one prefers ue as a replacement of ü over u.

Tag: unicode

Unicode to ASCII (1)