<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Intrepid Blog &#187; unicode</title>
	<atom:link href="http://blog.affien.com/archives/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.affien.com</link>
	<description>A few thoughts</description>
	<lastBuildDate>Mon, 01 Mar 2010 00:58:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Unicode to ASCII (1)</title>
		<link>http://blog.affien.com/archives/2009/06/19/unicode-to-ascii-1/</link>
		<comments>http://blog.affien.com/archives/2009/06/19/unicode-to-ascii-1/#comments</comments>
		<pubDate>Fri, 19 Jun 2009 13:19:10 +0000</pubDate>
		<dc:creator>Bas Westerbaan</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[hack]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[unicode]]></category>

		<guid isPermaLink="false">http://blog.affien.com/?p=387</guid>
		<description><![CDATA[When I want to generate usernames from real names, which can contain non-ascii characters, you can&#8217;t simply ignore the unicode characters.  For instance, danielle@blaat.org is the right e-mail address for Daniëlle, danille@blaat.org isn&#8217;t.
There&#8217;s  trick.  Unicode has got a single code for ë itself, but it has also got a code which (simplified) [...]]]></description>
			<content:encoded><![CDATA[<p>When I want to generate usernames from real names, which can contain non-ascii characters, you can&#8217;t simply ignore the unicode characters.  For instance, danielle@blaat.org is the right e-mail address for Daniëlle, danille@blaat.org isn&#8217;t.</p>
<p>There&#8217;s  trick.  Unicode has got a single code for ë itself, but it has also got a code which (simplified) adds ¨ on top of the previous character.  The unicode standard defines a normal form in which (at least) all such characters, which can be, are represented using such modifiers.  If you then simply ignore the non-ascii representable codes, you&#8217;ll get the desired result.</p>
<p>In python: <code>unicodedata.normalize('NFKD', txt).encode('ASCII', 'ignore')</code>.</p>
<p>However, this isn&#8217;t <em>the</em> right solution. For instance, in german, one prefers ue as a replacement of ü over u.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.affien.com/archives/2009/06/19/unicode-to-ascii-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
