SingaporeComputer ScienceSyllabus dot point

How are text characters stored as numbers, and why did the world move from ASCII to Unicode?

Explain how characters are encoded using ASCII and Unicode, including code points and UTF-8, and the implications for storage and internationalisation

A focused answer to the H2 Computing outcome on character encoding. ASCII and its limits, Unicode code points, the UTF-8 variable-length scheme, and the implications for storage and supporting many languages.

Generated by Claude Opus 4.87 min answerUpdated 2026-06-06

Reviewed by: AI editorial process; not yet individually human-reviewed

Have a quick question? Jump to the Q&A page

Jump to a section

What this dot point is asking
The answer
Examples in context
Try this

What this dot point is asking

SEAB wants you to explain how text characters are stored as numbers under ASCII and Unicode, what a code point is, how the UTF-8 variable-length encoding works, and what these schemes mean for storage size and supporting the world's writing systems. The core idea is a two-stage mapping: character to code point, then code point to bytes.

The answer

ASCII and its limits

ASCII (American Standard Code for Information Interchange) maps each character to a 7-bit number, so it has $2^7 = 128$ codes. These cover the English letters, digits, punctuation and a set of non-printing control codes (such as newline and tab). The capital letter A is code 65, and lower-case letters sit 32 above their capitals, so a is 97.

ASCII's limitation is its size: 128 codes cannot fit accented European letters, let alone non-Latin scripts like Chinese, Tamil, Arabic or the thousands of other characters in use worldwide. Extended 8-bit variants added another 128 codes but still fell far short and disagreed with one another.

Unicode and code points

Unicode is a single universal standard that assigns every character in every writing system a unique number called a code point, written like U+0041 for A. Unicode has room for over a million code points, covering historic scripts, mathematical symbols and emoji. Crucially, Unicode defines the code points; how those numbers become bytes is a separate decision.

UTF-8: a variable-length encoding

UTF-8 encodes a code point in one to four bytes depending on its value:

Code points 0 to 127 (the ASCII set) use one byte, identical to ASCII.
Larger code points use two, three or four bytes, with the high bits of the first byte announcing the length and continuation bytes marked distinctly.

'A'  U+0041  -> 1 byte
'é'  U+00E9  -> 2 bytes
'中' U+4E2D  -> 3 bytes
'😀' U+1F600 -> 4 bytes

Implications for storage and internationalisation

Because UTF-8 spends only one byte on ASCII, English-dominant text stays compact, while any document can still contain any script by using more bytes where needed. UTF-8 is also backward compatible: every ASCII file is already valid UTF-8. This combination is why UTF-8 is the dominant encoding of the web.

Worked example

A text file contains the five characters Cafe! followed by an accented é, so the visible text is Cafe!é. Compare its size in bytes under pure ASCII (where possible) and under UTF-8.

Step 1: Identify each character's code point

C, a, f, e, ! are all ASCII (codes 67, 97, 102, 101, 33). é is U+00E9, outside ASCII.

Step 2: Try pure ASCII

ASCII cannot represent é at all - there is no 7-bit code for it - so the file cannot be stored as standard ASCII. This is exactly the limitation Unicode solves.

Step 3: Encode in UTF-8

The five ASCII characters take one byte each (5 bytes). The é at U+00E9 takes two bytes in UTF-8.

Step 4: Total the size

$5 \times 1 + 1 \times 2 = 7$ bytes in UTF-8, and the accented character is represented correctly - which pure ASCII could not do at all.

Examples in context

Example 1. Multilingual websites. A Singapore government page may mix English, Chinese, Malay and Tamil in one document. UTF-8 lets all four scripts coexist in a single file, with each character taking only as many bytes as it needs, which is why nearly every web page declares charset=utf-8.

Example 2. Sorting and searching text. Because each character has a numeric code point, a program can compare and sort strings by comparing their codes. Knowing that digits, then capitals, then lower-case letters fall in ascending ASCII order explains why a naive sort places Zebra before apple.

Try this

Q1. How many characters can standard 7-bit ASCII encode? [1 mark]

Cue. $2^7 = 128$ distinct characters.

Q2. State the difference between a Unicode code point and a UTF-8 byte sequence. [2 marks]

Cue. The code point is the abstract number identifying a character (U+00E9); the UTF-8 byte sequence is how that number is stored, here two bytes.

Q3. Give one reason UTF-8 is preferred over a fixed 4-byte encoding for English text. [1 mark]

Cue. English is almost all ASCII, so UTF-8 uses one byte per character, about a quarter of the storage of a 4-byte scheme.

Exam-style practice questions

Practice questions written in the style of SEAB exam questions on this dot point, with worked answer explainers. The year tag is the paper they imitate, not the source.

Original4 marks(a) State how many distinct characters standard 7-bit ASCII can represent and why. (b) Explain one limitation of ASCII that Unicode was created to solve. (c) State the relationship between a character and its code point.

Show worked answer →

(a) Standard ASCII uses 7 bits, so it can represent $2^7 = 128$ distinct characters: the digits, upper and lower case English letters, punctuation and control characters.

(b) ASCII covers only the English alphabet and basic symbols. It cannot represent accented letters, non-Latin scripts such as Chinese, Tamil or Arabic, or emoji. Unicode was created to give every character in every writing system its own unique code, enabling text in any language.

(c) A character is mapped to a numeric code point - an integer that identifies it in the standard. For example the letter A is code point 65 in ASCII and U+0041 in Unicode. The code point is then encoded into bytes for storage.

Markers reward $2^7 = 128$ with the bit reason, a concrete limitation (no non-English scripts), and the character-to-code-point mapping.

Original5 marksExplain how UTF-8 encodes Unicode code points, and give two advantages of UTF-8 over a fixed 4-byte-per-character scheme for storing a document that is mostly English with occasional non-Latin characters.

Show worked answer →

UTF-8 is a variable-length encoding. A code point is stored in one to four bytes depending on its value: the 128 ASCII characters use a single byte, and larger code points use two, three or four bytes, with the leading bits of each byte signalling how many bytes the character spans.

Two advantages for a mostly-English document:

Storage efficiency. English text is almost all ASCII, so each character is one byte in UTF-8 versus four bytes in a fixed 4-byte scheme - roughly a quarter of the size.
Backward compatibility. Any valid ASCII file is already valid UTF-8, so older tools and existing English text keep working unchanged.

A third point markers may accept: it still represents every Unicode character, so the occasional non-Latin characters are handled with the extra bytes only where needed.

Markers reward the variable-length one-to-four byte scheme, ASCII as one byte, and two genuine advantages (space saving and ASCII compatibility).