You and Your Research How To Become A Hacker 编程的智慧 An open letter to those who want to start programming Teach Yourself Programming in Ten Years

寬字符 (Wide character, wchar_t)
多字節字符 (multibyte character)
字符/字元 (character)
碼位/碼點 (code point)
- 字符對應的編號。
字符集/字元集 (character set)
- 字符和碼位的對應關係。
編碼 (encoding)
- 碼位實際儲存在內存或是磁盤上的內容。
- UTF-8 就是 Unicode 字符其中一種編碼。
- ASCII 是一種編碼。
- 碼位 (code point) 和編碼 (encoding) 是兩種不一樣的概念。
Byte order mark
- 指名後續編碼為大端或是小端。

代碼表/頁碼表 (code page)
- 字符對應的編碼 (encoding)。

byte string。即 char*。
character string。即 wchar_t*。

小結:
- Unicode 是字符集 (character set)。
- Unicode 有底下幾種編碼 (encoding):
  - UCS-2、UCS-4、UTF-8、UTF-16 (Windows 採用) 和 UTF-32。

Windows

Programming with Unicode (Windows)

Text and Strings in Visual C++
- Unicode and MBCS
  - Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as widened characters.
  - Windows 採用 UTF-16 做為 Unicode 字符集的編碼。
- Unicode: Windows ME/Windows 98 以前的平台不支援
- MBCS: Unicode 的替代品，在所有 Windows 平台皆支援。新開發的軟件不建議採用 MBCS，直接採用 Unicode。
- SBCS: 即 ACSII

Generic-Text Mappings in Tchar.h
- 巨集 _TCHAR 的對應如下:
  - Unicode (UTF-16): wchar_t。此為 Windows 規定。
  - MBCS: char
  - SBCS: char

小結:
- 工程設置一律採用 "Use Unicode Character Set"。
- 使用 TCHAR 宣告字符串常量和變數。
- 使用 _t 開頭的函式。
- How to do text on Windows 不建議上述兩項作法。

轉換

How to: Convert Between Various String Types
- A char * string (also known as a C style string) uses a null character to indicate the end of the string. C style strings usually require one byte per character, but can also use two bytes. In the examples below, char * strings are sometimes referred to as multibyte character strings because of the string data that results from converting from Unicode strings.
- How to convert char* to wchar_t*?

多字節字符串 (multibyte character string) 轉寬字符串 (wide character string)

mbstowcs_s

    // Create and display a C style string, and then use it 
    // to create different kinds of strings.
    char *orig = "Hello, World!";
    cout << orig << " (char *)" << endl;
 
    // newsize describes the length of the 
    // wchar_t string called wcstring in terms of the number 
    // of wide characters, not the number of bytes.
    size_t newsize = strlen(orig) + 1;
 
    // The following creates a buffer large enough to contain 
    // the exact number of characters in the original string
    // in the new format. If you want to add more characters
    // to the end of the string, increase the value of newsize
    // to increase the size of the buffer.
    wchar_t * wcstring = new wchar_t[newsize];
 
    // Convert char* string to a wchar_t* string.
    size_t convertedChars = 0;
    mbstowcs_s(&convertedChars, wcstring, newsize, orig, _TRUNCATE);
    // Display the result and indicate the type of string that it is.
    wcout << wcstring << _T(" (wchar_t *)") << endl;

errno_t mbstowcs_s(
   size_t *pReturnValue,
   wchar_t *wcstr,
   size_t sizeInWords,
   const char *mbstr,
   size_t count 
);

mbstowcs_s uses the current locale for any locale-dependent behavior; _mbstowcs_s_l is identical except that it uses the locale passed in instead.

MultiByteToWideChar

    // 計算輸入(欲轉換)字串 pszValue 的字數 (character)。注意! 非字節數 (byte)。
    size_t n = ::MultiByteToWideChar(CP_ACP,0,(const char *)pszValue,-1,NULL,0);
 
    // 配置輸出(欲轉出)字串緩衝區。
    wchar_t* buffer = new wchar_t[n];
 
    // 轉換輸入字串 (pszValue) 至輸出字串緩衝區 (buffer)。
    ::MultiByteToWideChar(CP_ACP,0,(const char *)pszValue,-1,buffer,int(n));
 
    // 將緩衝區資料另存起來。
    m_strValue = tstring(buffer);
 
    // 釋放緩衝區。
    delete [] buffer;

int MultiByteToWideChar(
  _In_       UINT CodePage,
  _In_       DWORD dwFlags,
  _In_       LPCSTR lpMultiByteStr,
  _In_       int cbMultiByte,
  _Out_opt_  LPWSTR lpWideCharStr,
  _In_       int cchWideChar
);

CodePage
- 輸入字串的編碼。
lpMultiByteStr
- 輸入字串指針
cbMultiByte
- 欲處理輸入字串多少個字節數 (byte)。
- 若為 -1，代表輸入字串為空字符 (NULL) 結尾。MultiByteToWideChar 返回值代表輸入字串的字數 (character)，包含空字符。
lpWideCharStr
- 輸出字串緩衝區。可為 NULL。
cchWideChar
- 輸出字串緩衝區大小，以字數計 (character)。
- 若為 0，MultiByteToWideChar 返回值代表輸出字串緩衝區 (lpWideCharStr) 所需字數 (character)。

wstringstream

#include <sstream>
#include "tstring.h"
 
std::wstringstream wss;
wss << pszValue;
m_strValue = tstring(wss.str().c_str());

也谈C++中char*与wchar_t*之间的转换
尚未驗證其正確性。

寬字符串 (wide character string) 轉多字節字符串 (multibyte character string)

UTF-8 一般可以用 char* 表示，因為彼此皆為 8-bit 編碼。char 是 signed 或是 unsigned 不影響。
- UTF8 to/from wide char conversion in STL
  - ConvertUTF.h
- How can char[] represent an UTF-8 string?

小結:
- 編碼在編譯器之應用，編譯器內部應採用寬字符串 (wide character string)，即 Windows 內部支援的 Unicode; 外部輸入或輸出一律視為多字節字符串 (multibyte character string)，可能是 ASCII 或是 UTF-8。
- 編譯器使用的 Lex 應視輸入為 UTF-8，再轉換成寬字符，交給後續程序處理。

其它

Why does my application require Visual C++ Redistributable package

char* 轉換成 LPCTSTR

Linux

中文編碼

參考資料

UTF-8 and Unicode
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- 每個軟體開發者都絕對一定要會的Unicode及字元集必備知識(沒有藉口！)
- Clarification on Joel Spolsky's Unicode Article
UTF-8 Everywhere
Unicode HOWTO
Unicode 編碼速查
- Unicode lookup/search tool
- Unicode Character Search
自製編程語言第五章

Windows

轉換

其它

Linux

中文編碼

參考資料

搜索

登录