In problem 15.4-5, there appears to be a direct reduction to the LCS
problem without additional dynamic programming. Either approach is
fine.
In problem 15-3 part a, define a multi-dimensional table. Say what
the indices mean and say what the meaning is of each value stored in
the table. Show how to compute the edges of the table trivially and
show how to compute the interior cells as functions of other
cells. Next, *prove* that what you have computed is really what you
claim should be in the cells. Finally, show that the overall answer
can be read from the table at the end. (It may be convenient to
treat the anomolous "kill" operation separately.)
In part b, assign a cost to each of the operations copy, twiddle,
etc., so that the edit distance under these costs is the same as the
alignment score. You can disallow certain operations (e.g.,
kill?) or, equivalently, you can assign a cost of infinity.
In problem 16.3-8 (slight change from in class, one bit to two bits):
We are trying to compress uniformly random files of exactly n
characters, where n is known and there are 256 = 2^8 different
characters. We will compress the file x by the bit string f(x). We
require that, if x and y are different, then f(x) and f(y) are
different. It is not necessary that the range of f be prefix.
Show that, for any such f, E[|f(x)|] > 8n - 2.
Note that, in the straightforward encoding, for all x, we have
|f(x)| = 8n.
COMMENTARY for 16.3-8 FOLLOWS (unnecessary for doing the problem):
The moral is that one cannot compress random numbers. The 2 bits of
savings comes from a sort of cheating. We want to be able to
compress several files, one after another without endmarkers
separating the codewords, and be able to recover all the files.
Call such a code "self-delimiting." It costs something, either
additional bits or a new character, to put an endmarker after each
codeword and we should count this cost.
Note that some non-prefix codes are self-delimiting. For example,
take the prefix code:
{0. 10, 11}
and reverse all the codewords, getting
{0, 01, 11},
which is not prefix. Still, we can parse it from right to left as
we would a prefix code. For example, the string 0010110 must parse
as:
0010110
001011 0
0010 11 0
001 0 11 0
0 01 0 11 0,
even though
0 01 01 ...
is a legitimate start.
So there are several possible ways to interpret the question, even
still assuming that the length n of the file is known:
1. Show that any prefix encoding needs 8n bits. (This is now
fairly easy, using our knowledge about Huffman's algorithm for
finding a near-optimal prefix code. It turns out that the
straightforward encoding is optimal.)
2. Show that any self-delimiting code needs 8n bits. This is true,
but hard to show with our tools.
3. Show that any code, self-delimiting or not, needs more than
8n - 2 bits in expectation. (Unfortunately, 8n - 1 is not
true.) This is a reasonable amount of work for a problem and
still shows something moderately interesting, though the real
problem is for self-delimiting codes.
It is also true (and easy to show) that any code, self-delimiting or
not, needs 8n bits in WORSE case. This may be what they had in mind
(loose use of the terms "random" and "expect").