The Virtuosi (Posts about python)https://thephysicsvirtuosi.com/enContents © 2019 <a href="mailto:thephysicsvirtuosi@gmail.com">The Virtuosi</a> Thu, 24 Jan 2019 15:05:00 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rss- A Tweet is Worth (at least) 140 Wordshttps://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/Alemi<div><p><a href="http://2.bp.blogspot.com/-VJ3MBvt13Z4/Tl2Q7Z4J5WI/AAAAAAAAAWw/GG50fsyHvoo/s1600/twittercompression.png"><img alt="image" src="http://2.bp.blogspot.com/-VJ3MBvt13Z4/Tl2Q7Z4J5WI/AAAAAAAAAWw/GG50fsyHvoo/s400/twittercompression.png"></a></p>
<p>So, I recently read <a href="http://books.google.com/books?id=fXxde44_0zsC&printsec=frontcover&dq=An+Introduction+to+Information+Theory&hl=en&ei=7opdTrjhMMXrOarHmdIC&sa=X&oi=book_result&ct=result&resnum=1&ved=0CC0Q6AEwAA#v=onepage&q&f=false">An Introduction to Information Theory: Symbols,
Signals and
Noise</a>.
It is a very nice popular introduction to <a href="http://en.wikipedia.org/wiki/Information_Theory">Information
Theory</a>, a modern
scientific pursuit to quantify information started by <a href="http://en.wikipedia.org/wiki/Claude_Shannon">Claude
Shannon</a> in 1948. This got
me thinking. Increasingly, people try to hold conversations on
<a href="http://twitter.com/">Twitter</a>, where posts are limited to 140
characters. Just how much information could you convey in 140
characters? After some coding and investigation, I created
<a href="http://pages.physics.cornell.edu/~aalemi/twitter/">this</a>, an
experimental twitter English compression algorithm capable of
compressing around 140 words into 140 characters. So, what's the story?
Warning: It's a bit of a story, the juicy bits are at the end. UPDATE:
Tomo in the comments below made <a href="http://www.saigonist.com/b/twitter-decoder-ring">a chrome
extension</a> for the
algorithm</p>
<h4>Entropy</h4>
<p>Ultimately, we need some way to assess how much information is contained
in a signal. What does it mean for a signal to contain information
anyway? Is 'this is a test of twitter compression.' more meaningful than
'歒堙丁顜善咮旮呂'? The first is understandable by any english speaker,
and requires 38 characters. You might think the second is meaningful to
a speaker of chinese, but I'm fairly certain it is gibberish, and takes
8 characters. But, the thing is if you put those 8 characters into <a href="http://pages.physics.cornell.edu/~aalemi/twitter/">the
bottom form here</a>,
you'll recover the first. So, in some sense to the messages are
equivalent. They contain the same amount of information. Shannon tried
to quantify how we could estimate just how much information any message
contains. Of course it would be very hard to try to track down every
intelligent being in the universe and ask them if any particular message
had any meaning to them. Instead, Shannon reserved himself to trying to
quantify how much information was contained in a message produced by a
random source. In this regard, the question of how much information a
message contains becomes a more tractable question: How unlike is a
particular message from all other messages produced by the same random
source? This question might sound a little familiar. It is similar to a
question that comes up a lot in <a href="http://en.wikipedia.org/wiki/Statistical_physics">Statistical
Physics</a>, where we are
interested in just how unlike a particular configuration of a system is
from all possible configurations of a system. In Statistical physics,
the quantity that helps us answer questions like this is the
<a href="http://en.wikipedia.org/wiki/Entropy">Entropy</a>, where the entropy is
defined as $$ S = -\sum_i p_i \log p_i $$ where p_i stands for the
probability of a particular configuration, and we are supposed to sum
over all possible configurations of the system. Similarly, for our
random message source, we can define the entropy in exactly the same
way, but for convenience, let's replace the logarithm with the logarithm
base 2. $$ S = -\sum_i p_i \log_2 p_i $$ At this point, the
<a href="http://en.wikipedia.org/wiki/Shannon_entropy">Shannon Entropy, or Information
Entropy</a> takes on a real
quantitative meaning. It reflects how many bits of information the
message source produces per character. The result of all of this aligns
quite well with intuition. If we have a source that outputs two symbols
0 or 1 randomly, each with probability 1/2. The shannon entropy comes
out to be 1, meaning each of the symbols of our source is worth one bit,
which we already new. If instead of two symbols, our source can output
16 symbols, 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F say, the shannon entropy
comes out to be 4 bits per symbol, which again we should have suspected
since with four bits we can count up to the number 16 in <a href="http://en.wikipedia.org/wiki/Binary_numeral_system">base
2</a> (e.g. 0000 - 0,
0001 - 1, 0010 - 2 , etc ). Where it begins to get interesting is when
all of our symbols don't occur with equal probability. To get a sense of
this situation, I'll show 5 example outputs:</p>
<pre class="code literal-block"><span></span>'000001000100000000010000010000'
'000000000010000000000001000000'
'010100000000000000000000111000'
'010100000000000000000000111000'
'000000000100000000110000000010'
</pre>
<p>looking at these examples, it begins to become clear that since we have
a lot more zeros than ones, each of these messages contain less
information than the case when 0 and 1 occur with equal probability. In
fact, in this case, if 0 occurs 90% of the time, and 1 occurs 10% of the
time, the shannon entropy comes out to be 0.47. Meaning each symbol is
worth just less than half a bit. We should expect our messages in this
case to have to be about twice as long to encode the same amount of
information. In an extreme example, imagine you were trying to transmit
a message to someone in binary, but for some reason, your device had a
sticky 0 key so that every time you pushed 0, it transmitted 0 10 times
in a row. It should be clear in this case that as far as the receiver is
concerned, this is not a very efficient transmission scheme.</p>
<h4>English</h4>
<p>What does this have to do with anything? Well, all of that and I really
only wanted to build up a fact you already know. The fact is, the
English language is not very efficient on a per symbol basis. For
example, I'm sure everyone knows exactly what word will come at the end
of this <strong><em>_</em></strong>. There you go, I was able to express exactly the
same thought with at least 8 fewer characters. n fct, w cn d _ lt bttr
[in fact, we can do a lot better], using 22 characters to express a
thought that normally takes 31 characters. In fact, Shannon has a <a href="http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf">nice
paper</a> where he
attempted to measure the entropy of the english language itself. Using
more sophisticated methods, he concludes that english has an information
entropy of between 0.6 and 1.3 bits per character, let's call it 1 bit
per character. Whereas, if each of the 27 symbols (26 letters + space)
we commonly use each showed up equally frequently, we would have 4.75
bits per character possible. Of course, from a practical communication
standpoint, having redundancies in human language can be a useful thing,
as it allows us to still understand one another even over noisy phone
lines and with very bad handwriting. But, with modern computers and
faithful transmission of information, we really ought to be able to do
better.</p>
<h4>Twitter</h4>
<p>This brings me back to <a href="http://twitter.com/">twitter</a>. If you are
unaware, twitter allows users to post short, 140 character messages for
the rest of the world to enjoy. 140 characters is not a lot to go on.
Assuming 4.5 characters per word, this means that in traditionally
written english you're lucky to fit 25 words in a standard tweet. But,
we know now that we can do better. In fact, if we could come up with
some kind of crazy scheme to compress english in such a way as to use
each of the 27 usual characters so that each of those characters
appeared with roughly equal probability, we've seen that we could get
4.75 bits per character, with 140 characters and 5.5 symbols per word,
this would allow us to fit not 25 words in a tweet but 120 words. A
factor of 4.8 improvement. Of course we would have to discover this
miraculous encryption transformation. Which to my knowledge remains
undiscovered. But, we can do better. It turns out that twitter allows
you to use <a href="http://en.wikipedia.org/wiki/Unicode">Unicode</a> characters in
your tweets. Beyond enabling you to talk about Lagrangians (ℒ) and play
cards (♣), it enables international communication, by including foreign
alphabets. So, in fact we don't need to limit ourselves to the 27
commonly used English symbols. We could use a much larger alphabet,
Chinese say. I choose Chinese because there are over 20,900 Chinese
alphabet symbols in Unicode. Using all of these characters, we could
theoretically encode 14.3 bits of information per character, with 140
characters, and 1 bit per English character, and 5.5 symbols per English
word, we could theoretically fit over 365 English words in a single
tweet. But alas, we would have to discover some magical encoding
algorithm that could map typed English to the Chinese alphabet such that
each of the Chinese symbols occurred with equal probability. I wasn't
able to do that well, but I did make an attempt.</p>
<h4>My Attempt</h4>
<p>So, I tried to compress the English language, an design an effective
mapping from written English to the Chinese character set of Unicode. We
know that we aim to have each of these Chinese characters occur with
equal probability, so my algorithm was quite simple. Let's just look at
a bunch of English and see which pair of characters occur with the
highest probability, and map these to the first Chinese character in the
Unicode set. Replace their occurring in the text, rinse, and repeat.
This technique is guaranteed to reduce the probability at which the most
common character occurs at every step, by taking some if its occurrences
and replacing them, so it at least aims to achieve our ultimate goal.
That's it. Of course, I tried to bootstrap the algorithm a little bit by
first mapping the most common 1500 words to their own symbols. For
example, consider the first stanza of the <a href="http://en.wikipedia.org/wiki/The_raven">Raven by Edger Allen
Poe</a></p>
<pre class="code literal-block"><span></span>Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore--
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
"'Tis some visiter," I muttered, "tapping at my chamber door--
Only this and nothing more."
</pre>
<p>The most common character is ' ' (the space). The most common pair is 'e
' (e followed by space), so let's replace 'e ' with the first Chinese
Unicode character '一' we obtain:</p>
<pre class="code literal-block"><span></span>Onc一upon a midnight dreary, whil一I pondered, weak and weary,
Over many a quaint and curious volum一of forgotten lore--
Whil一I nodded, nearly napping, suddenly ther一cam一a tapping,
As of som一on一gently rapping, rapping at my chamber door.
"'Tis som一visiter," I muttered, "tapping at my chamber door--
Only this and nothing more.'
</pre>
<p>So we've reduced the number of spaces a bit. Doing one more step, now
the most common pair of characters is 'in', which we replace by '丁'
obtaining:</p>
<pre class="code literal-block"><span></span>Onc一upon a midnight dreary, whil一I pondered, weak and weary,
Over many a qua丁t and curious volum一of forgotten lore--
Whil一I nodded, nearly napp丁g, suddenly ther一cam一a tapp丁g,
As of som一on一gently rapp丁g, rapp丁g at my chamber door.
"'Tis som一visiter," I muttered, "tapp丁g at my chamber door--
Only this and noth丁g more.'
</pre>
<p>etc. The end results of the effort are <a href="http://pages.physics.cornell.edu/~aalemi/twitter/">demo-ed
here</a>. Feel free to
play around with it. For the most part, typing some standard English, I
seem to be able to get compression ratios around 5 or so. Let me know
how it does for you. I'll leave you with this final message:</p>
<pre class="code literal-block"><span></span>儌咹乺悃巄格丌凣亥乄叜
</pre>
<p>Python code that I used to do the heavy lifting is available as <a href="https://gist.github.com/1182747">a
gist</a>.</p></div>entropyinformation theorypythontwitterhttps://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/Tue, 30 Aug 2011 23:49:00 GMT
- Dartshttps://thephysicsvirtuosi.com/posts/old/darts/Alemi<div><p><a href="http://4.bp.blogspot.com/_YOjDhtygcuA/TS9kYgpoSMI/AAAAAAAAAPc/2xDWZVjOGC8/s1600/dart_target.jpg"><img alt="image" src="http://4.bp.blogspot.com/_YOjDhtygcuA/TS9kYgpoSMI/AAAAAAAAAPc/2xDWZVjOGC8/s320/dart_target.jpg"></a></p>
<p>Over break I went out with a buddy of mine and played some darts. This
got me to thinking, where exactly should someone aim in order to get the
largest expected number of points? Now, obviously when you are playing a
game like <a href="http://en.wikipedia.org/wiki/Cricket_(darts)">Cricket</a>, where
you should aim is fairly obvious, you are trying to hit particular
numbers on the board, but in the most popular darts game
-<a href="http://en.wikipedia.org/wiki/Darts#Playing_darts">501</a>, for most of
the game you are just trying to accumulate points. So, where should you
shoot on the board to get the most points? Well, something that I didn't
quite realize before I started this adventure is that while the double
bullseye in the center is worth 50 points, the triple 20 is worth more:
60 points. For the uninitiated, in games like 501 you score points based
on where the dart falls. The center is the bullseye, where the inner
most circle is worth 50 and the ring around it is worth 25, after that
you score depending on which of the pie slice things you fall in, the
points being the number on the slice. The little ring around the outside
is worth double points, and the little ring at about half the board
radius is worth triple points. So perhaps the triple 20 is where you
should be aiming all the time. But you'll notice that to the left and
right of the 20 section are low numbers 1 and 5. So you might suspect
that if you can't throw all that accurately, you'll be paying a price
for shooting at the triple 20.</p>
<h4>The Model</h4>
<p><a href="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9mjSvWVMI/AAAAAAAAAPk/Q3dKlgTH47M/s1600/dartsdistsig1p0.png"><img alt="image" src="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9mjSvWVMI/AAAAAAAAAPk/Q3dKlgTH47M/s320/dartsdistsig1p0.png"></a></p>
<p>In order to answer a question like that, we need to develop a model for
dart throwing. In this case, I thought it was safe to assume that dart
throws are <a href="http://en.wikipedia.org/wiki/Normal_distribution">normally
distributed</a> about the
place you aim, with some sigma determined by your skill level. To the
left is an example of what normally-distributed dart throws look like
when the aim is at the center, and with a 1 inch standard deviation in
the throws. The dashed line marks a one inch ring to give a sense of how
scattered darts can be from 1 standard deviation.</p>
<h4>Results</h4>
<p>So, off I went, having drawn a dart board (to regulation) in Gimp, and
coloring each section in gray scale according to its point values, I
used python to perform all of the necessary computations (using
primarily the ndimage package in scipy). The result can be seen below.</p>
<p><a href="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9j9ivNItI/AAAAAAAAAPU/QtuSM7MZr48/s1600/dartscircleplusdot.png"><img alt="image" src="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9j9ivNItI/AAAAAAAAAPU/QtuSM7MZr48/s400/dartscircleplusdot.png"></a></p>
<p>This image shows the optimal position on the board to aim for as a
function of how good of a player you are. The rings denote the sigmas,
and the dots the center point to aim for. The colorscale gives a
quantitative measure of the sigma, in inches. As you can see, the best
players should (and do according to youtube) aim for the triple 20,
since they are good enough to hit it most of the time, but once you're
throw is at about a 1 inch sigma, you should be aiming for the triple 19
in the bottom left. As you can see on the numbered board at the top, the
triple 19 is buffered on either side by the 3 and the 7, which are both
2 points above the 20 section's neighbors (1 and 5). So as you might
expect if you have a reasonable chance of hitting the sections to either
side, the triple 19 offers a higher expected score in the long run. The
other limit we can understand is the limit of really bad throws. If you
have a nontrivial chance of missing the board altogether, then obviously
you should just aim for the center of the board, in the hopes that you
at least hit the thing. But interestingly, in between the track that the
optimal aiming point takes is a little interesting. It tends to the
center (as we should expect), but it takes a curvy sort of root along
the bottom left quadrant of the board. Neat.</p>
<h4>Heat Maps</h4>
<p>In order to get a little better of a feel for why the track takes the
path it does, I decided to look at the heat maps for the expected score
at every location on the board for a set of given sigmas. So, in the
images below, the colors above the board indicate the relative score
expected if you aimed at that point.</p>
<p><a href="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9piZleQaI/AAAAAAAAAPs/xKR1XVK4oM0/s1600/darts-sig0p25flair.png"><img alt="image" src="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9piZleQaI/AAAAAAAAAPs/xKR1XVK4oM0/s200/darts-sig0p25flair.png"></a></p>
<p>Above is for a quarter inch sigma throw [Click to zoom]. Notice that the
triple 20 is the place to hit, as expected.</p>
<p><a href="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9pwXSiutI/AAAAAAAAAP0/UW-U3zETxkU/s1600/darts-sig0p50flair.png"><img alt="image" src="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9pwXSiutI/AAAAAAAAAP0/UW-U3zETxkU/s200/darts-sig0p50flair.png"></a></p>
<p>Above is a half inch sigma throw. The triple 20 is still in the lead,
but not by a whole lot. You can really see how if your aim is as good as
a half inch sigma, you can really still see the triple spots as true
features.</p>
<p><a href="http://2.bp.blogspot.com/_YOjDhtygcuA/TS9qDpBCjOI/AAAAAAAAAP8/nnP8us-V3yU/s1600/darts-sig1p00flair.png"><img alt="image" src="http://2.bp.blogspot.com/_YOjDhtygcuA/TS9qDpBCjOI/AAAAAAAAAP8/nnP8us-V3yU/s200/darts-sig1p00flair.png"></a></p>
<p>Above is a 1 inch sigma throw. Now the lower left hand quadrant has
taken over as the optimal place to throw. Notice that both the triple 16
and triple 19 make decent targets. The triple 14 also makes a showing,
due probably to its large neighbors.</p>
<p><a href="http://4.bp.blogspot.com/_YOjDhtygcuA/TS9qf0GXLiI/AAAAAAAAAQE/64Jua-PBMtE/s1600/darts-sig1p50flair.png"><img alt="image" src="http://4.bp.blogspot.com/_YOjDhtygcuA/TS9qf0GXLiI/AAAAAAAAAQE/64Jua-PBMtE/s200/darts-sig1p50flair.png"></a></p>
<p>Above is a 1.5" sigma. The triple 20 is nearly gone as a place of
interest on the board, since we are no longer good enough to really
capitalize on it. The lower left hand portion of the board is the place
to be. We've really sort of lost any distinct features of the triple
spots, and now are just looking at quadrants of the board as a whole.
Our aim seems to tend to center a bit, as we are now in a little danger
of falling off the board.</p>
<p><a href="http://2.bp.blogspot.com/_YOjDhtygcuA/TS9rAU2TGsI/AAAAAAAAAQM/uqIvzG_jqoE/s1600/darts-sig2p00flair.png"><img alt="image" src="http://2.bp.blogspot.com/_YOjDhtygcuA/TS9rAU2TGsI/AAAAAAAAAQM/uqIvzG_jqoE/s200/darts-sig2p00flair.png"></a></p>
<p>At 2" sigma, we can really only hope to aim left-of-center.</p>
<p><a href="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9rLG9LRKI/AAAAAAAAAQU/1cz6YZer9Bs/s1600/darts-sig2p50flair.png"><img alt="image" src="http://1.bp.blogspot.com/_YOjDhtygcuA/TS9rLG9LRKI/AAAAAAAAAQU/1cz6YZer9Bs/s200/darts-sig2p50flair.png"></a></p>
<p>At 2.5" sigma, we really just want to hit the board.</p>
<h4>Lesson</h4>
<p>So, now I know, personally, I really just ought to aim just left of
center.</p></div>dartsfunpythonhttps://thephysicsvirtuosi.com/posts/old/darts/Thu, 13 Jan 2011 16:29:00 GMT