Category Archives: Computer Science

Hash table – Separate Chaining, Open Hashing

ref – https://en.wikipedia.org/wiki/Associative_array
https://github.com/redmacdev1988/LinkedListHashTable

Open hashing – in this strategy, none of the objects are actually stored in the hash table’s array; instead once an object is hashed, it is stored in a list which is separate from the hash table’s internal array. “open” refers to the freedom we get by leaving the hash table, and using a separate list. By the way, “separate list” hints at why open hashing is also known as “separate chaining”.

The most frequently used general purpose implementation of an associative array is with a hash table: an array combined with a hash function that separates each key into a separate “bucket” of the array. The basic idea behind a hash table is that accessing an element of an array via its index is a simple, constant-time operation. Therefore, the average overhead of an operation for a hash table is only the computation of the key’s hash, combined with accessing the corresponding bucket within the array. As such, hash tables usually perform in O(1) time, and outperform alternatives in most situations.

Hash tables need to be able to handle collisions: when the hash function maps two different keys to the same bucket of the array. The two most widespread approaches to this problem are separate chaining and open addressing.

https://en.wikipedia.org/wiki/Hash_table#Separate_chaining

Average | Worst Case

Space O(n)[1] | O(n)
Search O(1) | O(n)
Insert O(1) | O(n)
Delete O(1) | O(n)

The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the algorithm computes an index that suggests where the entry can be found:

index = f(key, array_size)

hash = hashfunc(key) // where hash is some number
index = hash % array_size // in order to fit the hash into the array size, it is reduced to an index using modulo operator

In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and array_size − 1) using the modulo operator (%) .

Choosing a hash function

http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_hashtable.aspx

Table size and range finding

The hash functions introduced in The Art of Hashing were designed to return a value in the full unsigned range of an integer. For a 32-bit integer, this means that the hash functions will return a value in the range [0..4,294,967,296). Because it is extremely likely that your table will be smaller than this, it is possible that the hash value may exceed the boundaries of the array.

The solution to this problem is to force the range down so that it fits the table size.

For example, if the table size is 888, and we get 8,403,958, how do we fit this value within the table?

A table size should not be chosen randomly because most of the collision resolution methods require that certain conditions be met for the table size or they will not work correctly. Most of the time, this required size is either a power of two, or a prime number.

Why a power of two? Because we use bitwise operation to get performance benefits. Bitwise operations are done on binary bit combinations and thus, that’s why the table size needs to be in power of 2. A table size of a power of two may be desirable on some implementations where bitwise operations offer performance benefits. The way to force a value into the range of a power of two can be performed quickly with a masking operation.

For example, to force the range of any value into eight bits, you simply use the bitwise AND operation on a mask of 0xff (hexadecimal for 255):

0x8a AND 0xff = 0x8a

from hex to digit:

Note that _ _ _ _ _ _ _ _ = 8 bits = 1 byte.
2 2 2 2 2 2 2 2 = 2^8 = 256 representations (memory addresses)

0x8a -> 1000(8) 1010(a) -> 1000 1010 binary

binary to decimal is 1 * 128 + 0 * 64 + 0 * 32 + 0 * 16 + 1 * 8 + 0 * 4 + 1 * 2 + 0 * 1 = 128 + 10 = 138.

from digit to hex:

Thus, if we were to get a value of 138, we force this value to give us a return index from an array size of 8 bits by using AND operation with 0xff.

Thus, in code, you’ll get the parameter 138. Then you convert it to 1000 1010, then to hex which is 0x8a.
Then you apply the AND bit op which gives you 0x8a AND 0Xff = 0x8a

so 138 in 256. But what if you get some large number like 888?
888 in binary is 0x378

0x378 AND 0x0ff = 78

we append 0 to front of 0xff because we’re dealing with 3 hex places. We apply the AND and get 78. Thus if you get hash value 888, it would give you index 78.

table[hash(key) & 0xff]
This is a fast operation, but it only works with powers of two. If the table size is not a power of two, the remainder of division can be used to force the value into a desired range with the remainder operator. Note that this is slightly different than masking because while the mask was the upper value that you will allow, the divisor must be one larger than the upper value to include it in the range. This operation is also slower in theory than masking (in practice, most compilers will optimize both into the same machine code):

table[hash(key) % 256]
When it comes to hash tables, the most recommended table size is any prime number.

This recommendation is made because hashing in general is misunderstood, and poor hash functions require an extra mixing step of division by a prime to resemble a uniform distribution. (https://cs.stackexchange.com/questions/11029/why-is-it-best-to-use-a-prime-number-as-a-mod-in-a-hashing-function)

Another reason that a prime table size is recommended is because several of the collision resolution methods require it to work. In reality, this is a generalization and is actually false (a power of two with odd step sizes will typically work just as well for most collision resolution strategies), but not many people consider the alternatives and in the world of hash tables, prime rules.

Advantages in using Hash Table

The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized.

If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by:
1) a careful choice of the hash function,
2) bucket table size,
3) internal data structures.

In particular, one may be able to devise a hash function that is collision-free, or even perfect. In this case the keys need not be stored in the table.

Open Addressing

Another popular technique is open addressing:

  1. at each index of our list we store one and one only key-value pair
  2. when trying to store a pair at index x, if there’s already a key-value pair, try to store our new pair at x + 1
    if x + 1 is taken, try x + 2 and so on…
  3. When retrieving an element, hash the key and see if the element at that position (x) matches our key. If not, try to access the element at position x + 1. Rinse and repeat until you get to the end of the list, or when you find an empty index — that means our element is not in the hash table

Separate chaining with linked lists

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—that is, it is roughly proportional to the load factor.

For this reason, chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list.

For separate-chaining, the worst-case scenario is when all entries are inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the worst-case cost is proportional to the number n of entries in the table.

The bucket chains are often searched sequentially using the order the entries were added to the bucket. If the load factor is large and some keys are more likely to come up than others, then rearranging the chain with a move-to-front heuristic may be effective. More sophisticated data structures, such as balanced search trees, are worth considering only if the load factor is large (about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must guarantee good performance even in a worst-case scenario. However, using a larger table and/or a better hash function may be even more effective in those cases.

Unicode

ref –

  • stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16
    www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
  • https://blog.hubspot.com/website/what-is-utf-8

Text is one of many assets that computers store and process. Text is made up of individual characters, each of which is represented in computers by a string of bits.

Problem is, which string of bits should match up to which letter? ASCII came up with a table to solve this.

So, the sentence “The quick brown fox jumps over the lazy dog.” represented in ASCII binary would be:

01010100 01101000 01100101 00100000 01110001
01110101 01101001 01100011 01101011 00100000
01100010 01110010 01101111 01110111 01101110
00100000 01100110 01101111 01111000 00100000
01101010 01110101 01101101 01110000 01110011
00100000 01101111 01110110 01100101 01110010
00100000 01110100 01101000 01100101 00100000
01101100 01100001 01111010 01111001 00100000
01100100 01101111 01100111 00101110

Because there are 8 bits in 1 byte, there are 2^8 (256) ways to represent a character in ASCII. When ASCII was introduced in 1960, this was okay, since developers needed only 128 bytes to represent all the English characters and symbols they needed.

Unicode is a character set

Unicode assigns every character in existence a unique number called a code point.

In other words, a letter maps to something called a code point

As Joel would explain it, every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639

This magic number is called a code point.

The U+ means “Unicode”
The numbers are hexadecimal.

The English letter A would be U+0041.

Why is it U+0041?

0041 is in hex.
We convert it to decimal like so: is 0 * (16^4)^4 + 0 * (16^3) + 4 * (16^2) + 1 * (16^1) = 0 + 0 + 64 + 1 = 65.
65 is represented as A.

What is Unicode?

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.

Unicode comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

10FFFF (hex)

1 0 15 15 15 15 convert each hex to digit 0-9, A(10) – F(15)

0001 0000 1111 1111 1111 1111 convert digits to binary

1114111 binary to decimal

we also have decimal 0, so total is 1114112 representations of characters

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A –> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:

A

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.

That’s where encodings come in.

Step 1 – Number Code to decimal

Let’s look at an example. In Unicode the character A is given code point U+0041. The U+ denotes that its a unicode. The 0041 is represented as hex.

hex 0041 to decimal is converted like so:

0 * (16^3)^4 + 0 * (16^2) + 4 * (16^1) + 1 * (16^0) =
0 + 0 + 64 + 1 = 65.

Thus, decimal 65 is represented as A.

Step 2 – Convert decimal to Binary?

Bi-nary means two.

Binary is represented by bits.
1 Bit binary represents 0 or 1. 2 ^ (1 bit) = 2 combinations
2 bits binary represents 00, 01, 10, 11, 4 combinations. 2 ^ (2 bits) = 4 combinations

Point codes represent text in computers, telecommunications equipment, and other devices.
It maps the character “A” to the number 65

How do we convert this decimal to binary?

Division Way

(the proper way is to divide by 2, and use the remainder as bit)
http://www.electronics-tutorials.ws/binary/bin_2.html

65 / 2 = 32 R1 binary bits: [1]
32 / 2 = 16 R0 binary bits: [0 1]
16 / 2 = 8 R0 binary bits: [0 0 1]
8 / 2 = 4 R0 binary bits: [0 0 0 1]
4 / 2 = 2 R0 binary bits: [0 0 0 0 1]
2 / 2 = 1 R0 binary bits: [0 0 0 0 0 1]
1 / 2 = 0 R1 binary bits: [1 0 0 0 0 0 1]

So as you can see the bits are 1 0 0 0 0 0 1
So in 8 bit format we have, 0 1 0 0 0 0 1

That is the binary conversion from decimal point 65.

Layout visual way
First, we must lay out the binary and show what decimal it represents

The 0th bit is represented by 2 ^ 0 = 1
The 1st bit is represented by 2 ^ 1 = 2
The 2nd bit is represented by 2 ^ 2 = 4
The 3rd bit is represented by 2 ^ 3 = 8
The 4th bit is represented by 2 ^ 4 = 16
The 5th bit is represented by 2 ^ 5 = 32
The 6th bit is represented by 2 ^ 6 = 64
…so on.

8 bits
0 0 0 0 0 0 0 0

We lay them out and try to see which of these decimals add up to be 65. The trick is to get the largest number where its less than or equal to 65. In our case, its the 6th bit (64). Thus, at the the 6th bit we mark it 1.

0 1 0 0 0 0 0 0

65 – 64 = 1.

Then we try to find a bit where its less than or equal to 1.

Obviously, it would be perfect at 0th bit.

0 1 0 0 0 0 0 1

64 + 1 = 65 Thus it matches the 65 ASCII code.

65_to_binary_1

We see that at the 2^(6th bit) is 64. Then all we need is 1, where it is at 2^0.
Hence at the binary, we need to mark 1 for the 6th bit and the 0th bit in order to represent that we need the decimal value it represents:
+ 0 * (2 ^ 4)
0 * (2^7) + 1 * (2 ^ 6) + 0 * (2 ^ 5) + 0 * (2 ^ 4) + 0 * (2 ^ 3) + 0 * (2 ^ 2) + 0 * (2 ^ 1) + 1 * (2 ^ 0)

Make sure you use only 0 or 1 because we are dealing with binary.

65_to_binary_2

Finally, at the binary level, simply write the binary needed to represent the numbers:

01000010

65_to_binary_3

2) Binary to Hex

First, a hex has 16 code points: 0-15

where 0 – 9 takes on the numbers 0 – 9

and 10 – 15 takes on A, B, C, D, E, F

Thus a hex bit has 15 combinations of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F

This is represented by 4 binary bits because 2^4 = 16

Hence we need to break our binary bits into hex, or literally into 4 bit binary

Thus 01000001 gets divided into 4 bits so that each 4 bits can be converted to a hex: 0100 0001

Let’s convert it to hex:

0100 = 0 * (2^3) + 1 * (2^2) + 0 * (2^1) + 0 * (2^0) = 4 hex
0001 = 0 * (2^3) + 0 * (2^2) + 0 * (2^1) + 1 * (2^0) = 1 hex

(if you get 1111 = 16 –> F)

Thus, ASCII character ‘A’, is 65 code point, binary 0100001, or hex 0x41.

And if we are to check in the unicode table, We see that indeed A is mapped to decimal 65, with a hex of 0x41 for UTF8

and a hex of 0x0041 for UTF 16.

https://unicode-table.com/en/

unicode-a

Thus, you have now succesfully converted from Unicode’s Code point to binary representation.

UTF 8

UTF-8 is an encoding system for Unicode.

It can translate any Unicode character to a matching unique binary string

and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

There are other encoding systems for Unicode besides UTF-8, but UTF-8 is unique because it represents characters in one-byte units. Remember that one byte consists of eight bits, hence the “-8” in its name.

A character in UTF8 can be from 1 to 4 bytes long. UTF-8 can represent any character in the unicode standard. UTF-8 is backwards compatible with ASCII. UTF-9 is preferred encoding for e-mail and web pages.

The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

Unicode is a character set. UTF-8 is encoding

UTF-8 is defined to encode code points (any of the numerical values that make up the code space that contains a symbol) in one to four bytes.

UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.

The first UTF-8 byte signals how many bytes will follow it. Then the code point bits are “distributed” over the following bytes.

For example:
character: é
Unicode: U+00E9

Calculate Decimal:
0 * 16^3 + 0 * 16^2 + E * 16^1 + 9 * 16^0 =
0 + 0 + 224 + 9 = 233
So the decimal for é is 233

Decimal to Binary:

223/2 = 116 R1 [1]
116/2 = 58 R0 [0 1]
58/2 = 29 R0 [0 0 1]
29/2 = 14 R1 [1 0 0 1]
14/2 = 7 R0 [0 1 0 0 1]
7/2 = 3 R1 [1 0 1 0 0 1]
3/2 = 1 R1 [1 1 0 1 0 0 1]
1/2 = 0 R1 [1 1 1 0 1 0 0 1]

So the binary for character é is [1 1 1 0 1 0 0 1]

It is not part of the ASCII character set. UTF-8 represents this eight-bit number using two bytes.
It will take one byte to represent our binary. If we were to include header data, then it’ll be within two bytes. Hence, this is why we decided on using two bytes.

The first byte begins with 110: 110 XXXXX
The 1s indicate that this is a two-byte sequence.
0 indicates that the code point bits will follow.

The second byte being with 10 to signal that it is a continuation in a UTF8 sequence: 10 XXXXXX

Hence we have 110XXXXX 10XXXXXX in order to represent [1 1 1 0 1 0 0 1]

This leaves 11 slots for the bits. Hence we replace the X with our bits, starting from the right.

110XXX[11] 10[101001]

We have 3 X’s left. UTF-8 pads the leading bits with 0s to fill out the remaining spaces.

UTF-8 representation is: 11000011 10101001
And we split them into 4 bits and get the binary value. We then convert binary value into an hex.

1100 = 1 * 2^3 + 1 * 2^2 + 0 * 2^1 + 0 * 2^0 = 12 0xC
0011 = 0 * 2^3 + 0 * 2^2 + 1 * 2^1 + 1 * 2^0 = 3 0x3
1010 = 1 * 2^3 + 0 * 2^2 + 1 * 2^1 + 0 * 2^0 = 10 0xA
1001 = 1 * 2^3 + 0 * 2^2 + 0 * 2^1 + 1 * 2^0 = 9 0x9

Hence, your UTF-8 code units is C3A9. This matches up with our U+00E9 for character é

Let’s convert Unicode character 八 into different representations

The code point is given as: U+516B

code point to decimal
5 * 16^3 + 1 * 16^2 + 6 * 16^1 + B * 16^0 =
5*4096 + 256 + 96 + 11 = 20843

The decimal representation is 20843

decimal to binary

20843/2 = 10421 R1 [1]
10421/2 = 5210 R1 [1 1]
5210/2 = 2605 R0 [0 1 1]
2605/2 = 1302 R0 [0 0 1 1]


5/2 = 2 R1 [1 0 0 0 1 0 1 1 0 1 0 1 1]
2/2 = 1 R0 [0 1 0 0 0 1 0 1 1 0 1 0 1 1]
1/2 = 0 R1 [1 0 1 0 0 0 1 0 1 1 0 1 0 1 1]

Hence the binary representation is 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1
We call it UTF-8 Code point bits

We first know that it is 15 bits. Hence we will need a full 2 bytes to represent it. If we were to include the header info for utf-8, then we’ll need a total of 3 bytes to represent this plus header info.

The first byte begins with 1110.
111 indicates its a three-byte sequence.
The 0 indicates that the code point bits will follow.

The second byte begins with 10 to signal that it is a continuation in a UTF-8 sequence.

The third byte begins with 10 to signal that it is a continuation in a UTF-8 sequence.

Therefore, the 3 byte utf-8 looks like this:
1110XXXX 10XXXXXX 10XXXXXX

We then fill in our binary representation starting from the right
1 0 1 0 0 0 1 0 1 1 0 1 0 1 1
[1110]X101 [10]000101 [10]101011

Padding bits: We now have one spot left over, so we’ll just fill it with a 0.
11100101 10000101 10101011

Split this into 4 bit groups:
1110 0101 1000 0101 1010 1011

each group represents hexadecimal.

Calculate decimal value of binary:
1110: 14 0xB
0101: 5 0x5
1000: 8 0x8
1010: 5 0x5
1010: 10 0xA
1011: 11 0xB

UTF-8 representation is: B5 85 AB

Binary Data and Base64

Binary Data – any data represented in binary form. It is a sequence of bytes. Each byte is 8 bits.

The term “binary file” is often used as a term meaning “non-text file”.

Binary files typically contain bytes that are intended to be interpreted as something other than text characters.

* Base64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation.

* Base64 is a way to encode binary data into a character set known to pretty much every computer system, in order to transmit the data without loss or modification of the contents itself.

When you have some binary data that you want to ship across a network, you generally don’t do it by just streaming the bits and bytes over the wire in a raw format.

Why?

…because some media are made for streaming text. You never know — some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you’ve entered a special character combination (like how FTP translates line endings).

So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings. Why 64? Because you can generally rely on the same 64 characters being present in many character sets, and you can be reasonably confident that your data’s going to end up on the other side of the wire uncorrupted.

* Some transportation protocols only allow alphanumerical characters to be transmitted. Just imagine a situation where control characters are used to trigger special actions and/or that only supports a limited bit width per character. Base64 transforms any input into an encoding that only uses alphanumeric characters, +, / and the = as a padding character.

Text content

M a n

ASCII

‘M’ 77 (0x4d) ‘a’ 97 (0x61) ‘n’ 110 (0x6e)

Bit pattern

‘M’, 77 = 64 + 0 + 0 + 8 + 4 + 0 + 1
01001101

‘a’, 97 = 64 + 32 + 0 + 0 + 0 + 0 + 1
01100001

‘n’ 110 = 64 + 32 + 0 + 8 + 4 + 2 + 0
01101110

Base 64 means that groups of 6 bits (6 bits have a maximum of 2^6 = 64 different binary values) are converted into individual numbers from left to right (in this case, there are four numbers in a 24-bit string), which are then converted into their corresponding Base64 character values:

01001101 01100001 01101110

becomes

010011 010110 000101 101110
where
010011 is 19
010110 is 22
000101 is 5
101110 is 46

The base 64’s table (as opposed to ASCII table),
converts 0-25 as ‘A’ – ‘Z’,
26-51 as ‘a’ – ‘z’,
and 52-61 as 0 – 9

Hence
Base64-encoded

010011 (19) is T
010110 (22) is W
000101 (5) is F
101110 (46) is u