utf 8 - Windows-1252 to UTF-8 encoding

Question

Welcome To Ask or Share your Answers For Others

utf 8 - Windows-1252 to UTF-8 encoding

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

utf 8 - Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

question from:https://stackoverflow.com/questions/2014069/windows-1252-to-utf-8-encoding

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:36:02+0000

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

Categories

utf 8 - Windows-1252 to UTF-8 encoding

utf 8 - Windows-1252 to UTF-8 encoding

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags