How are text characters stored as numbers, and why did the world move from ASCII to Unicode?
Explain how characters are encoded using ASCII and Unicode, including code points and UTF-8, and the implications for storage and internationalisation
A focused answer to the H2 Computing outcome on character encoding. ASCII and its limits, Unicode code points, the UTF-8 variable-length scheme, and the implications for storage and supporting many languages.
Reviewed by: AI editorial process; not yet individually human-reviewed
Have a quick question? Jump to the Q&A page
Jump to a section
What this dot point is asking
SEAB wants you to explain how text characters are stored as numbers under ASCII and Unicode, what a code point is, how the UTF-8 variable-length encoding works, and what these schemes mean for storage size and supporting the world's writing systems. The core idea is a two-stage mapping: character to code point, then code point to bytes.
The answer
ASCII and its limits
ASCII (American Standard Code for Information Interchange) maps each character to a 7-bit number, so it has codes. These cover the English letters, digits, punctuation and a set of non-printing control codes (such as newline and tab). The capital letter A is code 65, and lower-case letters sit 32 above their capitals, so a is 97.
ASCII's limitation is its size: 128 codes cannot fit accented European letters, let alone non-Latin scripts like Chinese, Tamil, Arabic or the thousands of other characters in use worldwide. Extended 8-bit variants added another 128 codes but still fell far short and disagreed with one another.
Unicode and code points
Unicode is a single universal standard that assigns every character in every writing system a unique number called a code point, written like U+0041 for A. Unicode has room for over a million code points, covering historic scripts, mathematical symbols and emoji. Crucially, Unicode defines the code points; how those numbers become bytes is a separate decision.
UTF-8: a variable-length encoding
UTF-8 encodes a code point in one to four bytes depending on its value:
- Code points 0 to 127 (the ASCII set) use one byte, identical to ASCII.
- Larger code points use two, three or four bytes, with the high bits of the first byte announcing the length and continuation bytes marked distinctly.
'A' U+0041 -> 1 byte
'é' U+00E9 -> 2 bytes
'中' U+4E2D -> 3 bytes
'😀' U+1F600 -> 4 bytes
Implications for storage and internationalisation
Because UTF-8 spends only one byte on ASCII, English-dominant text stays compact, while any document can still contain any script by using more bytes where needed. UTF-8 is also backward compatible: every ASCII file is already valid UTF-8. This combination is why UTF-8 is the dominant encoding of the web.
Examples in context
Example 1. Multilingual websites. A Singapore government page may mix English, Chinese, Malay and Tamil in one document. UTF-8 lets all four scripts coexist in a single file, with each character taking only as many bytes as it needs, which is why nearly every web page declares charset=utf-8.
Example 2. Sorting and searching text. Because each character has a numeric code point, a program can compare and sort strings by comparing their codes. Knowing that digits, then capitals, then lower-case letters fall in ascending ASCII order explains why a naive sort places Zebra before apple.
Try this
Q1. How many characters can standard 7-bit ASCII encode? [1 mark]
- Cue. distinct characters.
Q2. State the difference between a Unicode code point and a UTF-8 byte sequence. [2 marks]
- Cue. The code point is the abstract number identifying a character (U+00E9); the UTF-8 byte sequence is how that number is stored, here two bytes.
Q3. Give one reason UTF-8 is preferred over a fixed 4-byte encoding for English text. [1 mark]
- Cue. English is almost all ASCII, so UTF-8 uses one byte per character, about a quarter of the storage of a 4-byte scheme.
Exam-style practice questions
Practice questions written in the style of SEAB exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.
Original4 marks(a) State how many distinct characters standard 7-bit ASCII can represent and why. (b) Explain one limitation of ASCII that Unicode was created to solve. (c) State the relationship between a character and its code point.Show worked answer →
(a) Standard ASCII uses 7 bits, so it can represent distinct characters: the digits, upper and lower case English letters, punctuation and control characters.
(b) ASCII covers only the English alphabet and basic symbols. It cannot represent accented letters, non-Latin scripts such as Chinese, Tamil or Arabic, or emoji. Unicode was created to give every character in every writing system its own unique code, enabling text in any language.
(c) A character is mapped to a numeric code point - an integer that identifies it in the standard. For example the letter A is code point 65 in ASCII and U+0041 in Unicode. The code point is then encoded into bytes for storage.
Markers reward with the bit reason, a concrete limitation (no non-English scripts), and the character-to-code-point mapping.
Original5 marksExplain how UTF-8 encodes Unicode code points, and give two advantages of UTF-8 over a fixed 4-byte-per-character scheme for storing a document that is mostly English with occasional non-Latin characters.Show worked answer →
UTF-8 is a variable-length encoding. A code point is stored in one to four bytes depending on its value: the 128 ASCII characters use a single byte, and larger code points use two, three or four bytes, with the leading bits of each byte signalling how many bytes the character spans.
Two advantages for a mostly-English document:
- Storage efficiency. English text is almost all ASCII, so each character is one byte in UTF-8 versus four bytes in a fixed 4-byte scheme - roughly a quarter of the size.
- Backward compatibility. Any valid ASCII file is already valid UTF-8, so older tools and existing English text keep working unchanged.
A third point markers may accept: it still represents every Unicode character, so the occasional non-Latin characters are handled with the extra bytes only where needed.
Markers reward the variable-length one-to-four byte scheme, ASCII as one byte, and two genuine advantages (space saving and ASCII compatibility).
Related dot points
- Convert whole numbers between binary, denary and hexadecimal, and perform binary addition, explaining the role of place value and overflow
A focused answer to the H2 Computing outcome on number bases. Place value in binary and hexadecimal, conversion methods between binary, denary and hexadecimal, binary addition, and the meaning of overflow.
- Represent signed integers using two's complement, convert to and from denary, and perform subtraction by addition, explaining range and overflow
A focused answer to the H2 Computing outcome on signed integers. Two's complement encoding, converting to and from denary, subtraction as addition, the representable range, and detecting signed overflow.
- Apply bitwise AND, OR, XOR, NOT and shift operations, and use masks to set, clear, toggle and test individual bits
A focused answer to the H2 Computing outcome on bitwise operations. The AND, OR, XOR and NOT operators, left and right shifts, and using masks to set, clear, toggle and test individual bits.
- Handle runtime errors with try and except, and read from and write to text files safely in Python
A focused answer to the H2 Computing outcome on exceptions and files. The try, except, else and finally blocks, raising exceptions, and reading and writing text files safely with the with statement.
- Describe the relational model in terms of tables, rows, attributes, primary keys and foreign keys, and explain referential integrity
A focused answer to the H2 Computing outcome on the relational model. Tables, rows and attributes, primary and foreign keys, relationships between tables, and how referential integrity keeps data consistent.