C#中字符串的编解码和乱码问题

已有 217 人阅读此文 -2024-12-24 16:21:52-本站

最近在C#使用StringBuilder处理调用dll获得的字符串时，中文出现乱码，如原字符串为“hello 你好”，在StringBuilder获取后变成“hello 浣犲ソ”。使用的调用为：

[DllImport("user32")]

public static extern IntPtr SendMessage(IntPtr hWnd, NppMsg Msg, int wParam, [MarshalAs(UnmanagedType.LPWStr)] StringBuilder lParam);

首先尝试了加入CharSet，如 [DllImport("user32")，CharSet=CharSet.AUTO] 和 [DllImport("user32")，CharSet=CharSet.Unicode] ，均不奏效。

然后考虑重新对字符串进行编解码。从现象来看，英文是正常的，中文是乱码，可能是多字节字符编解码的问题。.NET中使用一个名为CodePage的int值来表示各种编码方式，从这里找到一个非常霸道的方法，遍历各种CodePage对字符串进行解码和编码，由此来寻找能够正确编解码的CodePage组合：

static void Main(string[] args)

{

StringBuilder sb = new StringBuilder();

string source = "hello 浣犲ソ";

foreach (var e1 in Encoding.GetEncodings())

{

foreach (var e2 in Encoding.GetEncodings())

{

byte[] unknow = Encoding.GetEncoding(e1.CodePage).GetBytes(source);

string result = Encoding.GetEncoding(e2.CodePage).GetString(unknow);

sb.AppendLine(string.Format("{0} => {1} : {2}", e1.CodePage, e2.CodePage, result));

}

File.WriteAllText("test.txt", sb.ToString());

}

运行结果写入文件，在其中搜索“hello 你好”，可以找到：

Line 3503: 936 => 65001 : hello 你好

Line 17319: 50227 => 65001 : hello 你好

Line 17599: 51936 => 65001 : hello 你好

Line 18051: 54936 => 65001 : hello 你好

可见用936、50227、51936、54936这四种CodePage解码后，再使用65001编码，可以得到正确的结果。在这里可以查到CodePage号码所对应的编码方式，936、50227、51936、54936都对应简体中文语言，也就是我的系统语言；65001对应Unicode (UTF-8)，也就是我的目标编码。

最后解决乱码的方法是，首先获取系统语言所对应的CodePage，对字符串解码后，编码为Unicode。

readonly int CURRENT_CODE_PAGE = Encoding.Default.CodePage;

readonly int TARGET_CODE_PAGE = Encoding.UTF8.CodePage;

byte[] raw = Encoding.GetEncoding(CURRENT_CODE_PAGE).GetBytes(text);

string newText = Encoding.GetEncoding(TARGET_CODE_PAGE).GetString(raw);