$\require{cancel}$

Unicode

Reference.
Unicode is the "new" standard for character encoding.
- New as in the first version of the standard was issued in 1991
- We are currently on version 16 (9/2024), with version 17 in the works
Unicode is a 8 (UTF-8), 16 (UTF-16) and 32 (UTF-32) bit standard.
- All characters can be represented in 32 bits.
- And is based on the ASCII.
- In fact, the ASCII table is first "table" in unicode
  - see This reference
  - In HTML &#xnnnn; will display a unicode character
  - G is hex 47 (G)
- Look at the list of Character sets supported
  - The Greek letters are 0370₁₆ to 03ff₁₆
  - see this chart
  - And cover α to Ω
  - Note just like ASCII, Upper case letters, then lower case letters in order.
- I can't type this up without some 𓅹 𓂀 Hieroglyphs!
- The standard includes
  - 135,000 characters in use
  - 800,000 unused characters
  - 6,400 for private use
- Some characters are used for composition
  - This: ü is a composition of u and ̈
  - There are three encoding forms, UTF-8, UTF-16 and UTF-32
    - We have been looking at UTF-16
    - But all forms require at most 4 bytes to represent any character.
  - It is more complex to use in a language like c++, but that is not our problem now.