Content Encoding

Junghoo Cho

Today’s Topic

  • MIME
  • Text encoding standards
    • ASCII
    • UNICODE

Multimedia over Internet

  • Q: Only “bits” are transmitted over the Internet. How does a browser/application interpret the bits and display them correctly?
  • Content-Type: header
    • e.g., Content-Type: text/html

MIME (Multi-purpose Internet Mail Extensions)

  • Standard way to indicate the type of the transmitted content
    • text vs image vs video vs …
  • Originally developed for email attachments, but currently used for all Internet data transmission
    • RFC2046
    • IANA (Internet Assigned Number Authority) manages the official registry of all media types

MIME Type Specification

  • Format: “type/subtype”
    • e.g., text/html
  • Popular MIME types (case insensitive)
    • Text: text/plain, text/html, text/css, …
    • Image: image/jpeg, image/png, image/gif, …
    • Audio: audio/mpeg (.mp3), audio/mp4 (.mpa), …
    • Video: video/mp4, video/h264, …
    • Application: application/pdf, application/octet-stream, …
    • Multipart: more on this in a later lecture

Browser Support

  • MIME type is specified in “Content-Type” HTTP header
    • E.g., Content-Type: text/html
  • Q: What multimedia types/format should a browser support?
  • No particular support is required
    • HTML5 is content-type/codec agnostic
    • But users expect supports for “popular” codecs, such as JPG, PNG, etc.

Legal Issues

  • 1999 UNISYS patent claim on GIF
  • 2010 uncertainty on H.264 Web streaming
  • Ongoing uncertainty on H.265
    • MPEG/LA vs HEVC Advance
    • Google’s push for AV1

Text Encoding

  • Q: For text, how does a browser map a sequence bits to characters?
  • Character encoding/Character set
    • Numbers ↔ characters
    • Many character encoding standards exist

Early Standards

  • ASCII (1963)
    • 7bits. 128 characters
    • extended to many 8-bit standards (e.g., ISO-8859-1)
    • basis of current standards for roman characters
  • EBCDIC (1963)
    • create by IBM for IBM mainframes
    • 8bits. designed to be easy to represent in punch cards
    • still used by some IBM mainframes

Local/Regional Encoding

  • Local character codes developed by each country
  • DBCS (Double Byte Code Character Set)
    • one or two bytes are used to represent a character
    • frequently used in Asia
    • Example: GB2312 (Simplified Chinese), EUC-KR (Korean), …

Code Page

  • Q: How does a computer know what encoding standard is used for a file in the system?
  • Early solution: system-wide specification
    • OS sets the global code page for all files in the computer
  • Code page (= character encoding)
    • a unique number given to a particular character encoding by a system
    • On Windows: Hebrew (862), Greek (727)
  • Q: Any problem with a system-wide code-page setting?

UNICODE

  • Motivation:
    • One standard for all existing characters in the world
    • Assign a unique number for every character in the world!
  • V1.0 was published in October 1991
    • managed by Unicode Consortium
    • (almost) yearly release of a new Unicode version

Code Point

  • Every character maps to a CODE POINT
    • A → U+0041
    • Hello → U+0048 U+0065 U+006C U+006C U+006F.
  • Originally defined to be a 16bit standard
    • No longer true. Currently 21bits (0x000000 – 0x10FFFFFF)
  • A CODE POINT is encoded into a sequence of bytes through an encoding scheme

UCS-2 (2-byte Universal Character Set)

  • First Unicode encoding scheme
  • Represent the (original) unicode characters with two bytes
    • U+0041 → 00 41
  • Unicode byte order mark: U+FEFF
    • little endian/big endian issue
    • gives hints on the endian mode
    • stored at the beginning of a Unicode string

UCS-2 Problems

  • Q: What will C program do for unicode-encoded data ‘a’ (00 41)?
  • Q: What will a UNICODE program do for the ASCII text input 41 42 43 44?
  • Due to backward compatibility issues, UCS-2 did not take off much on the Web

UTF-8 to the Rescue

  • We need to make Unicode backward compatible with ASCII!
  • Q: But how?
  • Idea:
    • Both UTF-8 and ASCII encoding should map all ASCII characters to the same 1-byte number
      • e.g., A: U+0041 → 41
      • Q: Why?

UTF-8: Variable-Length Encoding

  • Use 1-4 bytes depending on the CODE POINT range
    • Variable length encoding

UTF-8 Encoding

  • U+0000 - U+007F: encoded to 1 byte
    • [00000000] [0zzzzzzz] → [0zzzzzzz]
  • U+0080 - U+07FF: encoded to 2 bytes
    • [00000yyy] [yyzzzzzz] → [110yyyyy] [10zzzzzz]
  • U+0800 - U+FFFF: encoded to 3 bytes
    • [xxxxyyyy] [yyzzzzzz] →
      [1110xxxx] [10yyyyyy] [10zzzzzz]
  • U+10000 - U+10FFFF: encoded to 4 bytes
    • [___wwwxx] [xxxxyyyy] [yyzzzzzz] →
      [11110www] [10xxxxxx] [10yyyyyy] [10zzzzzz]

UTF-8 Examples

  • ‘A’: U+0041 (range 0000-007F)
    • [00000000] [01000001] → [01000001]
  • ‘Ɛ’: U+0190 (range 0080-07FF)
    • [00000001] [10010000] → [11000110] [10010000]
  • ‘한’: U+D55C (range 0800-FFFF)
    • [11010101] [01011100] → [11101101] [100101 01] [10011100]

UTF-8: Questions

  • Q: How many bytes are used to represent an ASCII character?
  • All existing ASCII-encoded data is UTF-8 encoded!
    • Due to backward compatibility, UTF-8 is most popular on the Web
    • Used by > 90% web sites
  • Q: If two texts have the same number of characters, do their UTF-8 encodings use the same number of bytes?
  • UTF-8 is variable-length encoding

UTF-16

  • Extension of UCS-2 to cover 21 bit code points
  • Variable length: either 2 bytes or 4 bytes
    • U+0000 to U+D7FF and U+E000 to U+FFFF: 2-byte encoding just like UCS-2
    • U+10000 to U+10FFFF: 4-byte encoding
  • Other Unicode encodings also exist
    • e.g., UTF-32: “32bit fixed-length encoding”, …

Using UNICODE (1)

  • Q: How can we use UNICODE?
  • HTTP
    • Character encoding is specified as the charset parameter of Content-Type header
    • E.g., Content-Type: text/html; charset=UTF-8
    • UTF-8 encoding is by far the most popular encoding standard
  • HTML:
    • A for U+0041 (A)

Using UNICODE (2)

  • Most modern OS’s support Unicode natively
    • Windows, macOS: UTF-16, Linux: UTF-8, …
  • Most modern languages, like Java, Javascript, and Python3, use unicode as the default string type
    • provide multiple encoding/decoding functions for UTF-8, UTF-16, ISO-8859-1,…

Using UNICODE (3)

  • Unicode support in C++ is messy
    • On Unix, standard libraries, like std::string, support UTF-8
      • wchar_t means different things depending on the OS
    • Windows supports UTF-16:
      • wchar_t (wide char) instead of char
      • wcs functions instead of str functions.
        • e.g., wcslen instead of strlen
      • prefix string constant with L, like L"Hello"
    • Mac supports UTF-16

References