Skip to main content

Comma Free Codes

We awe at Donald Knuth. I wondered, if I can understand a subject taught by Knuth and derive satisfaction of learning something directly from the master. I attended his most recent lecture on "comma free codes", felt that it was accessible and could be understood by putting some effort. This is my attempt to grasp the topic of "comma free codes", taught by Knuth for his 21st annual christmas tree lecture on Dec 2015. We will use some definitions directly from Williard Eastman's paper, reference the topics in wikipedia, look at Knuth's explanation.

We talk of codes in the context of information theory. A code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form or representation. A sequence of symbols, like a sequence of binary symbols, sequence of base-10 decimals or a sequence of English language alphabets can all be termed as "code". A block code is a set of codes having the same length.

Comma Free Block Code

Comma free code is a code that can be easily synchronized without any external unit like comma or space, "likethis". Comma free block code is set of same length codes having the comma free property.

The four letter words in "goodgame" is recognizable, it easy to derive those as "good" and "game". Other possible substring four letter words in that phrase "oodg", "odga", "dgga" are invalid words in english (or non code-words) and thus we did not have any problem separating the codewords when they were not separated by delimiters like space or comma. Anecdotally, Chinese and Thai languages do not use space between words.

Take an alternate example, "fujiverb". Can you say deterministically if the word "jive" is my code word? Or my code words consists only of "fuji" and "verb". You cannot determine it from this message and thus, "fuji" and "verb" do not form valid a "comma free block codes".

The same applies to a periodic code word like "gaga". If a message "gagagaga" occurs, then the middle word "gaga" will be ambiguous as it is composed of 2-letter suffix and a 2-prefix of our code word and we wont be able to differentiate it.

Mathematical definition

Comma free code words are defined like this.

A block code, C containing words of length n is called comma free if, and only if, for any words \(w = w_1, w_2 ... w_n. \: and \: x = x_1, x_2 ... x_n\) belonging to C, the n letter overlaps \(w_k ... w_nx_1 .... x_{k-1} (k = 2, ... n)\) are not words in the code.

This simply means that if two code words are joined together, than in that joined word, any substring from second letter to the last of the block code length should not be a code word.

How to find them?

Backtracking.

The general idea to find comma free block codes is use a backtracking solution and for every word that we want to add to the list, prune through through already added words and find if the new word can be a substring of two words joined together from the existing list. Knuth gave a demo of finding the maximum comma free subset of the four letter words.

commafree_check.py (Source)

def check_comma_free(input_string):
  if check_periodic(input_string):
    print("input string is periodic, it cannot be commafree.")
    return
  if len(comma_free_words) == 0:
    comma_free_words.append(input_string)
  else:
    parts = get_parts(input_string)
    for head, tail in parts:
      if (any_starts_with(head) and any_ends_with(tail)) or (any_starts_with(tail) and any_ends_with(head)):
        print("%s|%s are part of the previous words." % (head, tail))
        return
    comma_free_words.append(input_string)

This logic is dependent on the order in which comma free block codes are analyzed. For finding a maximal set in a given alphabet size in any order a proper backtracking based solution should be devised, which considers all the cases of insertions.

How many are there?

Backtracking based solution requires us to intelligently prune the search space. Finding effective strategies for pruning the search space becomes our the next problem in finding the comma free codes. We will have to determine how many comma free block codes are possible for a given alphabet size and for a given length.

For 4 letter words, (n = 4) of the alphabet size m, we know that there are \(m^4\) possible words (permutation with repetition). But we're restricted to aperiodic words of length 4, of which there are \(m^4 - m^2\). Notice further that if word, item has been chosen, we aren't allowed to include any of its cyclic shifts temi, emit*, or mite, because they all appear within itemitem. Hence the maximum number of codewords in our commafree code cannot exceed \((m^4 - m^2)/4\).

Let us consider the binary case, m = 2 and length n = 4, C(2, 4). We can choose four-bit "words" like this.

[0001] = {0001, 0010, 0100, 1000},

[0011] = {0011, 0110, 1100, 1001},

[0111] = {0111, 1100, 1101, 1011},

The maximum number of code words from our formula will be \(2^4 - 2^2/4 \: = \: 3\). Can we choose three four-bit "words" from the above cyclic classes? Yes and choosing the lowest in each cyclic class will simply do. But choosing the lowest will not work for all n and m.

In the class taught by Knuth, we analyzed the choosing codes when m = 3 {0, 1, 2} and for n = 3, C(3, 3). The words in the category were

000 111 222 # Invalid since they are periodic

001 010 100 # A set of cyclic shifts, only one can taken as a valid code word.

002 020 200

011 110 101

012 120 201

021 210 102

112 121 211

220 202 022

221 212 122

The number 3-alphabet code words of length 3 is 27 ( = \(3^3\)). The set of valid code words in this will be \((3^3-3) / 3 = 8\).

Choosing the lowest index will not work here for e.g, if we choose 021 and 220, and we send the word 220021 the word 002 is conflicting as it is part of our code word. With any back-tracking based solution, we will have to determine the correct non-cyclic words to choose in each set to form our maximal set of 8 code words.

The problem of finding comma free code words increases exponentially to the size of the length of the code word and on the code word size. For e.g, The task of finding all four-letter comma free codes is not difficult when m = 3, and only 18 cycle classes are involved. But it already becomes challenging when m = 4, because we must then deal with \((4^4 - 4^2) / 4 = 60\) classes. Therefore we'll want to give it some careful thought as we try to set it up for backtracking.

Willard Eastman came up with clever solution to find a code word for any odd word length n over an infinite alphabet size. Eastman proposed a solution wherein if we give a n letter word (n should be odd), the algorithm will output the correct shift required to make the n letter word a code word.

Eastman's Algorithm

Construction of Comma Free Codes

The following elegant construction yields a comma free code of maximum size for any odd block length n, over any alphabet.

Given a sequence of \(x =x_0x_1...x_{n-1}\) of nonnegative integers, where x differs from each of its other cyclic shifts \(x_k...x_{n-1}x_0..x_{k-1}\) for 0 < k < n, the procedure outputs a cyclic shift \(\sigma x\) with the property that the set of all such \(\sigma x\) is a commafree.

We regard x as an infinite periodic sequence \(<x_n>\) with \(x_k = x_{k-n}\) for all \(k \ge n\). Each cyclic shift then has the form \(x_kx_{k+1}...x_{k+n-1}\). The simplest nontrivial example occurs when n = 3, where \(x=x_0 x_1 x_2 x_0 x_1 x_2 x_0 ...\) and we don't have \(x_0 = x_1 = x_2\). In this case, the algorithm outputs \(x_kx_{k+1}x_{k+2}\) where \(x_k > x_{k+1} \le x_{k+2}\); and the set of all such triples clearly satisfies the commafree condition.

The idea expressed is to choose a triplet (a, b, c) of the form.

\begin{equation*} a \: \gt b \: \le c \end{equation*}

Why does this work?

If we take two words, xyz and abc following this property, combining them we have,

\begin{equation*} x \: \gt y \: \le z \quad a \: \gt b \: \le c \end{equation*}
  • yza cannot be a word because z cannot be > than y.

  • zab cannot be a word because a cannot be < than b.

There by none of the substrings will be a code word and we can satisfy the comma free property.

And if we use this condition to determine the code words in our C(3,3) set, we will come up with the following codes which can form valid code words.

000 111 222
001 010 100
002 020 200
011 110 101
012 120 201
021 210 102
112 121 211
220 202 022
221 212 122

The highlighted words will form valid code words and all of these satisfy the criteria, \(a \: \gt b \: \le c\) Now, if you are given a word like 211201212, you know for sure that they are composed of 211, 201 and 212 as none of other intermediaries like (112, 120, 201, 012, 121) occur in our set.

Eastman's algorithm helps in finding the correct shift required to make any word a code word.

For e.g,

Input: 001 Output: Shift by 2, thus producing 100

Input: 221 Output: Shift by 1, thus producing 212

And the beauty is, it is not just for words of length 3, but for any odd word length n.

The key idea is to think of x as partitioned into t substrings by boundary marked by \(b_j\) where \(0 \le b_0 \lt b_1 \lt ... \lt b_{t-1} < n\) and \(b_j = b_{j-t} + n\) for \(j \ge t\). Then substring \(y_j\) is \(x_{b_j} x_{b_{j+1}-1}\). The number t of substrings is always odd. Initially, t = n and \(b_j = j\) for all j; ultimately t = 1 and \(\sigma x = y0\) is the desired output.

Eastman's algorithm is based on comparison of adjacent substrings \(y_{j-1} and y_j\). If those substring have the same length, we use lexicographic comparison; otherwise we declare that the longer string is bigger.

The number of t substring is always odd because we went with an odd string length (n).

The comparison of adjacent substring form the recursive nature of the algorithm, we start with small substring of length 1 adjacent to each other and then we find compare higher length substring, whose markers have been found by the previous step. This will become clear as we look the hand demo.

http://ecx.images-amazon.com/images/I/41KZVIUGswL._SX332_BO1,204,203,200_.jpg

Basin and Ranges

It's convenient to describe the algorithm using the terminology based on the topograph of Nevada. Say that i is a basin if the substrings satisfy \(y_{i-1} \gt y_i \le y_{i+1}\). There must be at least one basin; otherwise all the \(y_i\) would be equal, and x would equal one of its cyclic shifts. We look at consecutive basins, i and j; this means that i < j and that i and j are basins, and that i+1 through j - 1 are not basins. If there's only one basin we have \(j = i + t\). The indices between consecutive basins are called ranges.

The basin and ranges is Knuth's terminology, taken from the book Basin and Ranges by John McPhee which describes the topology of Nevada. It is easier to imagine the construct we are looking for if we start to think in terms of basin and ranges.

Since t is odd, there is an odd number of consecutive basins for which \(j - i\) is odd. Each round of Eastman's algorithm retains exactly one boundary point in the range between such basins and deletes all the others. The retained point is the smallest \(k = i + 2l\) such that \(y_k \gt y_{k+1}\). At the end of a round, we reset t to the number of retained boundary points, and we begin another round if t > 1.

Word of length 19

Let's work through the algorithm by hand when n = 19 and x = 3141592653589793238

Phase 1

  • First markers differentiate each character.

  • We use . to denote the cyclic repetition of the 19 letter word.

3 | 1 | 4 |  1 | 5 | 9 | 2 | 6 | 5 | 3 | 5 | 8 | 9 | 7 | 9 | 3 | 2 | 3 | 8 . 3 | 1 | 4 | 1 | 5
  • Next we go about identifying basins. We identify the basins where for any 3 numbers (a, b, c), \(a \: \gt b \le c\) and put the markers below them

  • After the cyclic repetition we see the repetition of the basin. Like the last line below 1 is same as the first line. It is the basin that is repeated.

3  1  4  1  5  9  2  6  5  3  5  8  9  7  9  3  2  3  8  3  1  4  1 5

   |     |        |        |           |        |        .  |
  • We mark the ranges as odd length or even length ones.

3  1  4  1  5  9  2  6  5  3  5  8  9  7  9  3  2  3  8  3  1  4  1 5

---|--e--|---o----|---o----|-----e-----|---o----|-----e--.--|--------
  • Next, take all the odd length basin markers, go by steps of 2, 4, 6 so on and identify the first greater than number and place the new basin markers before them.

For e.g, in 1-5-9-2. The 2 length path is "1-5-9" and first higher will be 9 and we have to place the marker ahead of it. So, the phase 0 of eastman algorithm will output, 5, 8 and 15. denoting the indices where our basins are after the first phase.

If you are watching the video with Knuth giving a demo, there is a mistake in the video that second basin identifier is placed after 5, instead of before 5 (We should go by steps of 2 and place it before the first greater than number).

3  1  4  1  5  | 9  2  6  |  5  3  5  8  9  7  9  | 3  2  3  8  . 3  1  4 1  5

Phase 2

  • In the second phase, we use the basin markers of the previous phase and compare the sub strings denoted by the basin.

  • We take the substring of length 19, but now denoted by basins. The repetition of the string in the previous steps helped us here.

9  2  6  |  5  3  5  8  9  7  9  | 3  2  3  8  3  1  4 1  5
  • We apply the algorithm recursively on the strings 926, 5358979 and 323831415. We find that the string 323831415 is greater than the rest, so we can keep the basin marker ahead of it.

9  2  6  5  3  5  8  9  7  9  | 3  2  3  8  3  1  4 1  5

At the end of Phase 2, the algorithm outputs index 15, as the shift required to create the code word out of 19 word string. And thus our code word found by the eastman's algorithm is

3  2  3  8  3  1  4 1  5  9  2  6  5  3  5  8  9  7  9

Knuth's gave a demo with his implementation in CWEB. He shared a thought that even though algorithm is expressed recursively, the iterative implementation was straight forward. For the rest of the lecture he explores the algorithm on a binary string of PI of n = 19 and finds the shift required. Also, gives the probability of Eastman's algorithm finishing in one round, that is, just the phase 1.

All these are covered as exercises and answers in the pre-fascicle 5B of his volume 5 of The Art of Computer Programming, which can be explored in further depth.

Video

References

Tidbits

Earth is spherical

Short notes from my readings.

We all know that earth is a spherical, it has been taught to us since our childhood. The notion is so common that we somehow suppress our original thought, when we observe outside, that earth could be flat.

That brings an interesting question on how and when did human first come to realize that earth is not flat. Mere observation may not help as when we see around us, both cities and plains, that land we see is infinitely flat. Occasionally, when we look at the mountains, we observe the land going up and then coming down, it is the same, but in the reverse order when we observe it in the valley. If we iron out these oddities, we still land up on the same conclusion that earth could be flat.

However, careful observation and thinking by looking at the mountain, could help us derive at a different conclusion. If we look up the mountain, we see a smooth line of the earth and the sky touching. If the earth were flat, than that smooth line could be considered the boundary, in other words, horizon of the surface denoted by that horizontal line. But, if we venture past that mountain top, we know that that is not end as we saw from the bottom, but there is a slope down and then mountain was curved surfaced which looked horizontal to our eyes.

The same analogy could be drawn to a ship on the surface of an ocean and if we see a ship on the horizon, it slowly goes down the other side, little by little and never jumps off the cliff. This can help us derive that earth is not just flat, but it is a curved surface.

And we see this phenomenon throughout the ocean surfaces, which makes us believe that earth's curvature is uniform, in other words, spherical.

This kind of reason was first explained by pythogoras in 500 BC. Later in 340 BC, Aristotle settled this matter once and for all by providing numerous examples. Like, when observing lunar eclipse, when earth cast it's shadow on moon and we can observe that earth's surface is curved.

Modern world does not need much proof, we have photos of earth captured from space, which asserts that pythagoros guided the human thinking in the correct direction.

Trip to Santa Cruz, Carmel and Big Sur

I planned for a family weekend trip to few coastal towns in California. My dad was visiting us and before he returned back to India, I thought it was a good idea to tour some nearby places. Plus, it was a welcome relief after all the work of moving to the new home.

As my wife would say, something that I should avoid doing, it was last minute plan. On wednesday, I went about planning for my leave at office and started booking hotels. Hotels near tourist spots often get booked during weekends and I had to pay my price of getting hold of something which was a bit away from tourist spots, but in a beautiful small town called Salinas.

Using wikipedia and google maps, I noted down the following places that will be worth visiting.

We could make it most of them and I had to cancel a few because were very far from the place we stayed.

I have visited Monetery twice earlier, I was enthusiastic to travel to the peninsula again and take Siddhartha to Aquarium second time in row.

Santa Cruz beach has free concerts on Friday evenings and we got a chance to watch the live performance of of Y and T

At Bigsur, we could be on time for a 02:00 pm tour of Point Sur light house. It's trek lasting for 3 hours up to the light house led by a guide, who has a good narrative on light houses in california, living in early 1900s in the light house and current preservation efforts.

Next day, we visited Mission San Carlos at Carmel, which was one of the earliest missions established in California by Junipero Sierra. These monuments provide a window to peek at events back in time. I enjoy watching through it.

On Monday, on our way back to home, we made a trip to Hanuman Temple at Mount Madonna, Watsonville. This is actually a Yoga retreat setup by folks and is guided by an Indian guru by name Baba Ram Doss. The location of this Hindu temple was excellent and at atmosphere was serene.

That was our last spot in the journey. All throughout, we tried different local foods, and sometim food chains as it was easy choice. We planned a little, executed mostly on our plans.

Here are some of our Tour Photos.

Greatest in History according to H.G.Wells

This is a newspaper article written by H.G. Wells published in 1935. The scanned version is not very readable, but the content is very interesting. I thought, I will republish it here for everyone's enjoyment.


Greatest In History
CHOICE OF MR. H. G. WELLS CHRIST, BUDDHA, AND ARISTOTLE
Mr. H. G. Wells names Jesus of Nazareth, Buddha, and Aristotle as having had greater effect on world history than any other known figures.

Who are the three greatest men in history?

SOME 13 years ago I was asked to name the six greatest men in the world, says,  Mr. Wells in the "Readers' Digest." I did so. Lately. I have been confronted with my former answer and asked if I still adhere to it. Not altogether. Three of my great names stand as they stood then-but three, I must add it, seem to have lost emphasis.

The fact is that there are not six greatest names to cite.There are only; three.

When I was asked which single individual has left the most permanent immpression on the world, the manner of the questioner almost carried the implication that it was Jesus of Nazareth. I agreed.

He is, I think, a quite cardinal figure in human history, and it will be long before Western men decide-if ever they do decide, to abandon his life as the turning point in their reckoning of time. I am speaking of him, of course, as a man. The historian must treat him as a man, just as the painter must paint him as a man. We do not know as much about him as we would like to know; but the four gospels, though sometimes contradictory, agree in giving us a picture of a very definite personality; they carry a conviction of reality. To assume that he never lived, that the accounts of his life are inventions, is more difficult and raises far more problems for the historian than to accept the essential elements of the gospel stories as fact.

Of course, the reader and I live in countries where to millions of persons believe Jesus is more than a man. But the historian must disregard that fact. He must adhere to the evidence that would pass unchallenged if his book were to be read in every nation under the sun. Now, it is interesting and significant that a historian, without any theological bias whatever, should find that he cannot portray the progress of humanity honestly without giving a foremost place to a peniless teacher from Nazareth.

The old Roman historians ignored Jesus entirely; they left no impress on the historical records of his time. Yet, more than 1900 years later, a historian like myself, who does not even call himself a Christian, finds the picture centering irresistibly around the life and character of this most significant.

It is one of the most revolutionary changes of outlook that has ever stirred and changed human thought. No age has even yet understood fully the tremendous challenge it carries to the established institutions and subjugations of mankind. But the world began to be a different world from the day that doctrine was preached, and every step toward wider understanding and tolerance and good will is a step in the direction of that universal brotherhood Christ proclaimed.

The historian's test of an Individual's greatness is: "What did he leave to grow? Did he start men to thinking along fresh lines with a vigour that persisted after him?" By this test Jesus stands first. As with Jesus, so with Buddha, whom I would put very near in importance to Christ. You see clearly a man, simple, devout, lonely, battling for light, a vivid human personality, not a myth.

He, too. gave a message to mankind universal in its character. Many of our best modern ideas are in closest harmony with it. All the miseries and discontents of life are due, he taught, to selfishness. Selfishness takes three forms-one, the desire to satisfy the senses; another, the craving for immortality; and the third is the desire for prosperity, worldliness.

Before a man can become serene he must cease to live for his senses or himself. Then he merges into a greater being. Buddha in different language called men to self-forgetfulness 500 years before Christ. In some ways he was nearer to us and our needs. He was more lucid upon our individual importance in service than Christ and less ambiguous upon the question of personal immortality.

Aristotle Next. I would write the name of Aristotle, who is as cardinal in the story of the human intelligence as Christ and Buddha in the story of the human will. Aristotle began a great new thing in the world--the classifying and analysing of information. He was the father of the scientific synthesis. There had been thinkers in the world before, but he taught men to think together. He was the tutor of Alexander the Great, whose support made it possible for him to organise study on a scale and in a manner never before attempted.

At one time he had a thousand men, scattered throughout Asia and Greece, collecting material for his natural history. Political as well as natural science began with him. His students made an analysis of 153 political constitutions. Aristotle's Insistence on facts and their rigid analysis, the determination to look the truth in the face, was a vast new step in human progress.

These are three great names. I could write down 20 or 30 names. For the next three places. Plato? Mohammed? Confucius? I turn over names like Robert Owen, tie real founder of modrern Socialism. I can even weigh my pet aversion, Karl Marx, for a place. He made the world think of economic realities, even If he made it think a little askew.

Then what of those great astronomers who broke the crystal globe in whic man's imagination had been confined and let it out into limitless space. Bacon's Predictions then in that original selection of mine. I find that my own particular weakness for Roger Bacon crept in. He voiced a passionate insistence upon the need for experiment and of collecting knowledge. He predicted more than six hundred years ago, the advent of ships and trains that could be mechanically propelled; he also prophesied flying machines. He, too, set men thinking along new, fresh lines, and left an influence that has lived for the benefit of all generations. But when I come to put him beside Christ, Buddha, and Aristotle-it won't do.

Do you want an American in the list? Lincoln, better than any other, seemed to me to embody the essential characteristics of America. He stood for equality of opportunity, for the right and the chance of the child of the humblest home to reach the higest place. His simplicity, his humour, his patience, his deep-abiding optimism, based on the conviction that right would prevail-all these seemed to typify the best that America had to give mankind. Put, against those three who are enduring symbols of brotherhood and individual divinity, of service in self forgetfulness, and of the intellectual synthesis of mankind, what was rugged Abraham Lincoln?  Do you really want an American in the list yet? America is still young.

I think I will leave it at three.

Herbert George Wells

Papanasam

Papanasam is the first movie wherein I noticed that Kamal Hassan as a character in the movie didn't make any fun of god. In fact, in one scene wherein he tussles with the inspector at the tea shop, he refers to people as capable of being judged only by "the almighty". That's more philosophical, but in a later scene he mocks a cinema operator as "Nee yeena Karuppu Sattai ya?" (Are you an atheist?)  taking the opposite stance he usually plays on the subject.

The movie is same as Drishyam and I didn't mind watching it again in Tamil. What I liked about the whole thing was even an accomplished actor like Kamal agreeing to act ditto and not shying away from a remake. This, in my opinion, is a credit both to the story as well as to the creative artist who wants to do the same project in a different language because the project itself is a brilliant one.

Review: Page-a-Minute Memory Book

Page-a-Minute Memory Book

Page-a-Minute Memory Book by Harry Lorayne

My rating: 5 of 5 stars



Our brain and memory works in an associative manner. Author of this book stresses upon this point and gives good tools to create a associations for everything that we will like to remember.H is consiciously driving the reader to make associations and explaining the fact that "applying your mind" to something essentially means creating association.

I applied these techniques and found my joy in remembering new things and programming my mind.



View all my reviews

Uttama Villian

It is well known that Kamal Hassan wants screenplay of movies to be of the standards of literature.  If Uttma Villian's screenplay is written as a book, then Indian readers will recognize and love the humor of Tennali Raman embedded within a story of a modern vacuous movie star, who as soon as he comes to face with his knowledge of death, strives for immortality through art, relationships and love.