Content Encoding
Junghoo Cho
Today’s Topic
- MIME
- Text encoding standards
Multimedia over Internet
- Q: Only “bits” are transmitted over the Internet. How does a
browser/application interpret the bits and display them correctly?
Content-Type:
header
- e.g.,
Content-Type: text/html
MIME (Multi-purpose Internet Mail Extensions)
- Standard way to indicate the type of the transmitted content
- text vs image vs video vs …
- Originally developed for email attachments, but currently used
for all Internet data transmission
- RFC2046
- IANA (Internet Assigned Number Authority) manages the
official registry of all media types
MIME Type Specification
- Format: “type/subtype”
- Popular MIME types (case insensitive)
- Text: text/plain, text/html, text/css, …
- Image: image/jpeg, image/png, image/gif, …
- Audio: audio/mpeg (.mp3), audio/mp4 (.mpa), …
- Video: video/mp4, video/h264, …
- Application: application/pdf, application/octet-stream, …
- Multipart: more on this in a later lecture
Browser Support
- MIME type is specified in “Content-Type” HTTP header
- E.g.,
Content-Type: text/html
- Q: What multimedia types/format should a browser support?
- No particular support is required
- HTML5 is content-type/codec agnostic
- But users expect supports for “popular” codecs, such as JPG, PNG, etc.
Legal Issues
- 1999 UNISYS patent claim on GIF
- 2010 uncertainty on H.264 Web streaming
- Ongoing uncertainty on H.265
- MPEG/LA vs HEVC Advance
- Google’s push for AV1
Text Encoding
- Q: For text, how does a browser map a sequence bits to characters?
- Character encoding/Character set
- Numbers ↔ characters
- Many character encoding standards exist
Early Standards
- ASCII (1963)
- 7bits. 128 characters
- extended to many 8-bit standards (e.g., ISO-8859-1)
- basis of current standards for roman characters
- EBCDIC (1963)
- create by IBM for IBM mainframes
- 8bits. designed to be easy to represent in punch cards
- still used by some IBM mainframes
Local/Regional Encoding
- Local character codes developed by each country
- DBCS (Double Byte Code Character Set)
- one or two bytes are used to represent a character
- frequently used in Asia
- Example: GB2312 (Simplified Chinese), EUC-KR (Korean), …
Code Page
- Q: How does a computer know what encoding standard is used for a file in the system?
- Early solution: system-wide specification
- OS sets the global code page for all files in the computer
- Code page (= character encoding)
- a unique number given to a particular character encoding by a system
- On Windows: Hebrew (862), Greek (727)
- Q: Any problem with a system-wide code-page setting?
UNICODE
- Motivation:
- One standard for all existing characters in the world
- Assign a unique number for every character in the world!
- V1.0 was published in October 1991
- managed by Unicode Consortium
- (almost) yearly release of a new Unicode version
Code Point
- Every character maps to a CODE POINT
- A → U+0041
- Hello → U+0048 U+0065 U+006C U+006C U+006F.
- Originally defined to be a 16bit standard
- No longer true. Currently 21bits (0x000000 – 0x10FFFFFF)
- A CODE POINT is encoded into a sequence of bytes through an encoding scheme
UCS-2 (2-byte Universal Character Set)
- First Unicode encoding scheme
- Represent the (original) unicode characters with two bytes
- Unicode byte order mark: U+FEFF
- little endian/big endian issue
- gives hints on the endian mode
- stored at the beginning of a Unicode string
UCS-2 Problems
- Q: What will C program do for unicode-encoded data ‘a’ (00 41)?
- Q: What will a UNICODE program do for the ASCII text input 41 42 43 44?
- Due to backward compatibility issues, UCS-2 did not take off much on the Web
UTF-8 to the Rescue
- We need to make Unicode backward compatible with ASCII!
- Q: But how?
- Idea:
- Both UTF-8 and ASCII encoding should map all ASCII characters
to the same 1-byte number
- e.g., A: U+0041 → 41
- Q: Why?
UTF-8: Variable-Length Encoding
- Use 1-4 bytes depending on the CODE POINT range
UTF-8 Encoding
- U+0000 - U+007F: encoded to 1 byte
- [00000000] [0zzzzzzz] → [0zzzzzzz]
- U+0080 - U+07FF: encoded to 2 bytes
- [00000yyy] [yyzzzzzz] → [110yyyyy] [10zzzzzz]
- U+0800 - U+FFFF: encoded to 3 bytes
- [xxxxyyyy] [yyzzzzzz] →
[1110xxxx] [10yyyyyy] [10zzzzzz]
- U+10000 - U+10FFFF: encoded to 4 bytes
- [___wwwxx] [xxxxyyyy] [yyzzzzzz] →
[11110www] [10xxxxxx] [10yyyyyy] [10zzzzzz]
UTF-8 Examples
- ‘A’: U+0041 (range 0000-007F)
- [00000000] [01000001] → [01000001]
- ‘Ɛ’: U+0190 (range 0080-07FF)
- [00000001] [10010000] → [11000110] [10010000]
- ‘한’: U+D55C (range 0800-FFFF)
- [11010101] [01011100] → [11101101] [100101 01] [10011100]
UTF-8: Questions
- Q: How many bytes are used to represent an ASCII character?
- All existing ASCII-encoded data is UTF-8 encoded!
- Due to backward compatibility, UTF-8 is most popular on the Web
- Used by > 90% web sites
- Q: If two texts have the same number of characters, do their UTF-8 encodings use the same number of bytes?
- UTF-8 is variable-length encoding
UTF-16
- Extension of UCS-2 to cover 21 bit code points
- Variable length: either 2 bytes or 4 bytes
- U+0000 to U+D7FF and U+E000 to U+FFFF: 2-byte encoding just like UCS-2
- U+10000 to U+10FFFF: 4-byte encoding
- Other Unicode encodings also exist
- e.g., UTF-32: “32bit fixed-length encoding”, …
Using UNICODE (1)
- Q: How can we use UNICODE?
- HTTP
- Character encoding is specified as the
charset
parameter of Content-Type
header
- E.g.,
Content-Type: text/html; charset=UTF-8
- UTF-8 encoding is by far the most popular encoding standard
- HTML:
Using UNICODE (2)
- Most modern OS’s support Unicode natively
- Windows, macOS: UTF-16, Linux: UTF-8, …
- Most modern languages, like Java, Javascript, and Python3, use unicode as the default string type
- provide multiple encoding/decoding functions for UTF-8, UTF-16, ISO-8859-1,…
Using UNICODE (3)
- Unicode support in C++ is messy
- On Unix, standard libraries, like std::string, support UTF-8
wchar_t
means different things depending on the OS
- Windows supports UTF-16:
wchar_t
(wide char) instead of char
wcs
functions instead of str
functions.
- e.g.,
wcslen
instead of strlen
- prefix string constant with L, like
L"Hello"
- Mac supports UTF-16
- …