Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
536 views
in Technique[技术] by (71.8m points)

character encoding - Is ASCII code 7-bit or 8-bit?

My teacher told me ASCII is 8-bit character coding scheme. But it is defined only for 0-127 codes which means it can be fit into 7-bits. So can't it be argued that ASCII bit is actually 7-bit code?

And what do we mean to say at all when saying ASCII is 8-bit code at all?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

ASCII was indeed originally conceived as a 7-bit code. This was done well before 8-bit bytes became ubiquitous, and even into the 1990s you could find software that assumed it could use the 8th bit of each byte of text for its own purposes ("not 8-bit clean"). Nowadays people think of it as an 8-bit coding in which bytes 0x80 through 0xFF have no defined meaning, but that's a retcon.

There are dozens of text encodings that make use of the 8th bit; they can be classified as ASCII-compatible or not, and fixed- or variable-width. ASCII-compatible means that regardless of context, single bytes with values from 0x00 through 0x7F encode the same characters that they would in ASCII. You don't want to have anything to do with a non-ASCII-compatible text encoding if you can possibly avoid it; naive programs expecting ASCII tend to misinterpret them in catastrophic, often security-breaking fashion. They are so deprecated nowadays that (for instance) HTML5 forbids their use on the public Web, with the unfortunate exception of UTF-16. I'm not going to talk about them any more.

A fixed-width encoding means what it sounds like: all characters are encoded using the same number of bytes. To be ASCII-compatible, a fixed-with encoding must encode all its characters using only one byte, so it can have no more than 256 characters. The most common such encoding nowadays is Windows-1252, an extension of ISO 8859-1.

There's only one variable-width ASCII-compatible encoding worth knowing about nowadays, but it's very important: UTF-8, which packs all of Unicode into an ASCII-compatible encoding. You really want to be using this if you can manage it.

As a final note, "ASCII" nowadays takes its practical definition from Unicode, not its original standard (ANSI X3.4-1968), because historically there were several dozen variations on the ASCII 127-character repertoire -- for instance, some of the punctuation might be replaced with accented letters to facilitate the transmission of French text. Nowadays all of those variations are obsolescent, and when people say "ASCII" they mean that the bytes with value 0x00 through 0x7F encode Unicode codepoints U+0000 through U+007F. This will probably only matter to you if you ever find yourself writing a technical standard.

If you're interested in the history of ASCII and the encodings that preceded it, start with the paper "The Evolution of Character Codes, 1874-1968" (samizdat copy at http://falsedoor.com/doc/ascii_evolution-of-character-codes.pdf) and then chase its references (many of which are not available online and may be hard to find even with access to a university library, I regret to say).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...