How can I test the encoding of a text file… Is it valid, and what is it?
I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...
The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?
When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?
And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?
text-processing utilities character-encoding
add a comment |
I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...
The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?
When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?
And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?
text-processing utilities character-encoding
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30
add a comment |
I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...
The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?
When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?
And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?
text-processing utilities character-encoding
I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...
The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?
When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?
And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?
text-processing utilities character-encoding
text-processing utilities character-encoding
edited 45 mins ago
Peter Mortensen
89758
89758
asked Apr 19 '11 at 7:16
Peter.OPeter.O
18.9k1791144
18.9k1791144
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30
add a comment |
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30
add a comment |
2 Answers
2
active
oldest
votes
The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file command: http://www.linfo.org/file_command.html
Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1.file -i` reportedunknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"
– Peter.O
Apr 19 '11 at 9:21
add a comment |
It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.
Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.
There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.
file
PerlEncode::Guess(part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.
If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Usingiconv, I just ran all 1168 encodings (including aliases) listed byiconv -lagainst one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .
– Peter.O
Apr 19 '11 at 23:02
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f11602%2fhow-can-i-test-the-encoding-of-a-text-file-is-it-valid-and-what-is-it%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file command: http://www.linfo.org/file_command.html
Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1.file -i` reportedunknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"
– Peter.O
Apr 19 '11 at 9:21
add a comment |
The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file command: http://www.linfo.org/file_command.html
Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1.file -i` reportedunknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"
– Peter.O
Apr 19 '11 at 9:21
add a comment |
The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file command: http://www.linfo.org/file_command.html
The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.
Demonstration:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
Here is how I created the files:
$ echo ä > umlaut-utf8.txt
Nowadays everything is utf-8. But convince yourself:
$ hexdump -C umlaut-utf8.txt
00000000 c3 a4 0a |...|
00000003
Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding
Convert to the other encodings:
$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt
Check the hex dump:
$ hexdump -C umlaut-iso88591.txt
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006
Create something "invalid" by mixing all three:
$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt
What file says:
$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8
without -i:
$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text
The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.
One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.
Here is more information about the file command: http://www.linfo.org/file_command.html
edited Mar 30 '18 at 13:06
answered Apr 19 '11 at 7:35
lesmanalesmana
14.3k105772
14.3k105772
Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1.file -i` reportedunknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"
– Peter.O
Apr 19 '11 at 9:21
add a comment |
Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1.file -i` reportedunknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"
– Peter.O
Apr 19 '11 at 9:21
Thanks, that worked... I had tried 'file
, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"– Peter.O
Apr 19 '11 at 9:21
Thanks, that worked... I had tried 'file
, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"– Peter.O
Apr 19 '11 at 9:21
add a comment |
It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.
Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.
There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.
file
PerlEncode::Guess(part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.
If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Usingiconv, I just ran all 1168 encodings (including aliases) listed byiconv -lagainst one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .
– Peter.O
Apr 19 '11 at 23:02
add a comment |
It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.
Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.
There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.
file
PerlEncode::Guess(part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.
If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Usingiconv, I just ran all 1168 encodings (including aliases) listed byiconv -lagainst one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .
– Peter.O
Apr 19 '11 at 23:02
add a comment |
It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.
Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.
There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.
file
PerlEncode::Guess(part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.
If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.
It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or 羸 in BIG-5, and so on.
Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.
There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.
file
PerlEncode::Guess(part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.
Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.
If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.
edited Apr 13 '17 at 12:36
Community♦
1
1
answered Apr 19 '11 at 21:13
GillesGilles
533k12810721594
533k12810721594
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Usingiconv, I just ran all 1168 encodings (including aliases) listed byiconv -lagainst one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .
– Peter.O
Apr 19 '11 at 23:02
add a comment |
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Usingiconv, I just ran all 1168 encodings (including aliases) listed byiconv -lagainst one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .
– Peter.O
Apr 19 '11 at 23:02
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using
iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .– Peter.O
Apr 19 '11 at 23:02
Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using
iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .– Peter.O
Apr 19 '11 at 23:02
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f11602%2fhow-can-i-test-the-encoding-of-a-text-file-is-it-valid-and-what-is-it%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…
– buzz3791
Jun 16 '14 at 16:30