AB5 Quickselect and Huffman Coding

Implement the following functions using only the already imported functions. Additionally, the following language constructs are not permitted:

if ... then ... else ... (use if' instead if imported)
Guards
List Comprehensions: [(i,j) | i <- [1,2], j <- [3,4]]

Part 1: Quickselect

The Quickselect algorithm computes the \(k\)-th smallest element of a list inplace. This means the algorithm does not require any additional space except pointers to the list. The algorithm consists of two parts: partition and selection.

In the partition part a list is sorted around a pivot element. More specifically, after executing this part it should hold that all elements in the list which are smaller than the pivot element should be on its left side and all others on its right side. Example (red = pivot element, blue = element to be compared):

3	1	12	7	18	22	14	0	9	3 < 9 → continue
3	1	12	7	18	22	14	0	9	1 < 9 → continue
3	1	12	7	18	22	14	0	9	12 ≥ 9 → swap
3	1	9	7	18	22	14	0	12	7 < 9 → swap
3	1	7	9	18	22	14	0	12	18 ≥ 9 → continue
3	1	7	9	18	22	14	0	12	22 ≥ 9 → continue
3	1	7	9	18	22	14	0	12	14 ≥ 9 → continue
3	1	7	9	18	22	14	0	12	0 < 9 → swap 2x
3	1	7	0	9	22	14	18	12

In the second last row the element 0 is smaller than the pivot element 9. Therefore it must be on its left side. If one swaps the two elements then the elements 18, 22 und 14 will also be on the left side of 9. However, we already know that these are larger than or equal 9. Therefore we additionally swap 18 and 9. The partition part finishes after \(n-1\) elements have been compared with the pivot element where \(n\) is the length of the list (\(n=9\) in the above example).

The selection part works as follows. First, a pivot element is selected according to a given rule. You may assume that the last element of the remaining list is always selected as pivot element. Then the partition part is executed. Let \(p\) be the index of the pivot element after partition. We call the \(k\)-th smallest element desired element.

if \(p = k-1\) then the pivot element is the desired element
if \(p \lt k-1 \) then the desired element must be to the right of the pivot element
if \(p \gt k-1\) then the desired element must be to the left of the pivot element

The remaining list in which the desired element resides is encoded by its start and end index. In the two cases where \(p \neq k-1\) the selection is repeated for the remaining list. In the above example the pivot element 9 has index 4 after partition and therefore it is the 5-th smallest element of the list.

partition_list  :: [Int] -> Int -> Int -> Int -> [Int]
partition_pivot :: [Int] -> Int -> Int -> Int -> Int
select :: Int -> [Int] -> Int

The function partition_list returns the list arg1 after the remaining list between index arg2 and arg3 with the pivot element at index arg4 has been partitioned.
The function partition_pivot returns the index of the pivot element in the list (partition_list arg1 arg2 arg3 arg4).
The function select returns the arg1-th smallest element in the list arg2. You may assume that 1 <= arg1 <= (length arg2) and that arg2 is non-empty and does not contain duplicates.
Implement the functions by completing the programs in the files AB5_1/partition.sp and AB5_1/select.sp such that they implement the algorithm described above. Do not add any additional variables to the programs. Also, all variables besides list and res may only assume values between \(0\) and \(n\) where \(n\) is the length of list. The variable res is not allowed to appear in any predicate or in any operation on the right-hand side.

Part 2: Huffman Coding

Huffman coding is a method to assign code words of variable length to a finite set of source symbols \(\Sigma\). Code words are words over a given code alphabet \(\Gamma\). Source symbols which occur more frequently should be assigned shorter code words. The relative frequency of every source symbol \( p \colon \Sigma \rightarrow [0,1] \) is assumed to be given; furthermore, it is assumed that every source symbol occurs at least once, i.e. \(p(x) > 0\) holds for all \(x \in \Sigma\).

The Huffman encoding \(h \colon \Sigma \rightarrow \Gamma^* \) is constructed as follows. A set of trees is iteratively merged until only a single tree remains whose leafs correspond to the source symbols and the path to a leaf describes its code word. The nodes are labeled with subsets \(X\) of \(\Sigma\) and the cumulative relative frequency of such a set is \(p(X) = \Sigma_{x \in X} p(x)\).

In the beginning, a tree is created for each source symbol \(x\) which consists of a single node labeled with \(\{x\}\). Then the following step is repeated until only a single tree remains.

Choose \(|\Gamma|\) trees which have the lowest relative frequency. The relative frequency of a tree \(T\) is \(p(X)\) where \(X\) is the label of the root node of \(T\). If this selection criterion is ambiguous then prefer trees with smaller depth. If this is still ambiguous then prefer trees whose root node has a label which is lexicographically smaller. Merge these trees into a single one.

To ensure the unambiguity of this construction append the selected trees to the new tree in lexicographic order w.r.t. to the labels of the root nodes.

Example: given \(\Sigma = \{a,b,c,d,e\}\), \(\Gamma = \{0,1\}\) and \( p = \left(\begin{smallmatrix}a&b&c&d&e\\\frac{4}{20}&\frac{3}{20}&\frac{8}{20}&\frac{1}{20}&\frac{4}{20}\end{smallmatrix}\right)\), we get the following tree:

type CodeAlphabet   = [Char]
type Probabilities  = Map Char (Int,Int)
type HuffmanTree    = (Tree [Char])

huffmanTree   :: CodeAlphabet -> Probabilities -> HuffmanTree
huffmanEncode :: HuffmanTree -> CodeAlphabet -> String -> String
huffmanDecode :: HuffmanTree -> CodeAlphabet -> String -> String

The function huffmanTree returns the Huffman tree for a probability distribution arg1 and a code alphabet arg2. The set of source symbols is implicitly given as (map fst (toList arg2)).

Examples

ht1 = huffmanTree "01" (fromList [('a',(4,20)),('b',(3,20)),('c',(8,20)),('d',(1,20)),('e',(4,20))])
"abcde"
  "ae"
    "a"
    "e"
  "bcd"
    "bd"
      "b"
      "d"
    "c"
    
ht2 = huffmanTree "01" (fromList [('a',(4,20)),('b',(8,40)),('c',(4,20)),('d',(8,40)),('e',(4,20))])    
"abcde"
  "abe"
    "ab"
      "a"
      "b"
    "e"
  "cd"
    "c"
    "d"

ht3 = huffmanTree "012" (fromList [('a',(6,24)),('b',(2,24)),('c',(6,24)),('d',(2,24)),('e',(5,24)),('f',(2,24)),('g',(1,24))])
"abcdefg"
  "a"
  "bdefg"
    "bdg"
      "b"
      "d"
      "g"
    "e"
    "f"
  "c"

The function huffmanEncode converts arg3 into a code word (string over the code alphabet). The \(i\)-th subtree of a node of arg1 should correspond to the \(i\)-th character in arg2. Examples:

huffmanEncode ht1 "01" "abc" = "0010011"
huffmanEncode ht1 "yx" "abc" = "yyxyyxx"
huffmanEncode ht1 "10" "d" = "010"
huffmanEncode ht3 "ABC" "acd" = "ACBAB"

The function huffmanDecode decodes arg3 back into a string over the source alphabet, i.e. (huffmanDecode x y (huffmanEncode x y z)) == z. Implement this function by completing the program in the file AB5_2/decode.sp. The variable input should be treated as an input stream and the variable output should be treated as an output stream. This means the variable input may only be accessed via the operation o_next (reads the next character from the input stream) and the predicate p_endOfStream (true iff the input stream has reached the end meaning there are no more characters to be read). The variable output may only be accessed via the operation o_flush (buffer is appened to output).

The following questions are open-ended.
Follow-up question 1. Assume a German text \(T\) of length \(n\) consisting of the characters a through z, äöüß, whitespace and period (32 different characters) is given. The text \(T\) can be encoded using \(5n\) bits. How would you choose the set of source symbols \(\Sigma\) for \(T\) in order for the Huffman encoded version \(T_H\) of \(T\) to be as short as possible (\(\Gamma = \{0,1\}\))? What would you expect the quotient \(|T_H|/5n\) to look like?
Follow-up question 2. How could Huffman coding be used to compress binary data (pictures, music, videos, applications, etc.)?