How can I test the encoding of a text file… Is it valid, and what is it?












36















I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...



The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?



When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?



And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?










share|improve this question

























  • Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

    – buzz3791
    Jun 16 '14 at 16:30


















36















I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...



The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?



When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?



And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?










share|improve this question

























  • Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

    – buzz3791
    Jun 16 '14 at 16:30
















36












36








36


8






I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...



The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?



When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?



And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?










share|improve this question
















I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding...



The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and a List of encoding auto-detectors (currently "BOM XML-PI"), so my immediate problem has been resolved. But this got me thinking about: What if the meta data wasn't there?



When the encoding information is just not available, is there a CLI program which can make a "best-guess" of which encodings may apply?



And, although it is a slightly different issue; is there a CLI program which tests the validity of a known encoding?







text-processing utilities character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 45 mins ago









Peter Mortensen

89758




89758










asked Apr 19 '11 at 7:16









Peter.OPeter.O

18.9k1791144




18.9k1791144













  • Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

    – buzz3791
    Jun 16 '14 at 16:30





















  • Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

    – buzz3791
    Jun 16 '14 at 16:30



















Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

– buzz3791
Jun 16 '14 at 16:30







Similar to "How to auto detect text file encoding?" superuser.com/questions/301552/…

– buzz3791
Jun 16 '14 at 16:30












2 Answers
2






active

oldest

votes


















48














The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.



Demonstration:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


Here is how I created the files:



$ echo ä > umlaut-utf8.txt 


Nowadays everything is utf-8. But convince yourself:



$ hexdump -C umlaut-utf8.txt 
00000000 c3 a4 0a |...|
00000003


Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding



Convert to the other encodings:



$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt


Check the hex dump:



$ hexdump -C umlaut-iso88591.txt 
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006


Create something "invalid" by mixing all three:



$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 


What file says:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


without -i:



$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text


The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.



One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.



Here is more information about the file command: http://www.linfo.org/file_command.html






share|improve this answer


























  • Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    – Peter.O
    Apr 19 '11 at 9:21



















19














It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or in BIG-5, and so on.



Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.



There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.




  • file


  • Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.


  • Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.


If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.






share|improve this answer


























  • Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

    – Peter.O
    Apr 19 '11 at 23:02













Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f11602%2fhow-can-i-test-the-encoding-of-a-text-file-is-it-valid-and-what-is-it%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









48














The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.



Demonstration:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


Here is how I created the files:



$ echo ä > umlaut-utf8.txt 


Nowadays everything is utf-8. But convince yourself:



$ hexdump -C umlaut-utf8.txt 
00000000 c3 a4 0a |...|
00000003


Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding



Convert to the other encodings:



$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt


Check the hex dump:



$ hexdump -C umlaut-iso88591.txt 
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006


Create something "invalid" by mixing all three:



$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 


What file says:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


without -i:



$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text


The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.



One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.



Here is more information about the file command: http://www.linfo.org/file_command.html






share|improve this answer


























  • Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    – Peter.O
    Apr 19 '11 at 9:21
















48














The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.



Demonstration:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


Here is how I created the files:



$ echo ä > umlaut-utf8.txt 


Nowadays everything is utf-8. But convince yourself:



$ hexdump -C umlaut-utf8.txt 
00000000 c3 a4 0a |...|
00000003


Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding



Convert to the other encodings:



$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt


Check the hex dump:



$ hexdump -C umlaut-iso88591.txt 
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006


Create something "invalid" by mixing all three:



$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 


What file says:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


without -i:



$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text


The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.



One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.



Here is more information about the file command: http://www.linfo.org/file_command.html






share|improve this answer


























  • Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    – Peter.O
    Apr 19 '11 at 9:21














48












48








48







The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.



Demonstration:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


Here is how I created the files:



$ echo ä > umlaut-utf8.txt 


Nowadays everything is utf-8. But convince yourself:



$ hexdump -C umlaut-utf8.txt 
00000000 c3 a4 0a |...|
00000003


Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding



Convert to the other encodings:



$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt


Check the hex dump:



$ hexdump -C umlaut-iso88591.txt 
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006


Create something "invalid" by mixing all three:



$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 


What file says:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


without -i:



$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text


The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.



One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.



Here is more information about the file command: http://www.linfo.org/file_command.html






share|improve this answer















The file command makes "best-guesses" about the encoding. Use the -i parameter to force file to print information about the encoding.



Demonstration:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


Here is how I created the files:



$ echo ä > umlaut-utf8.txt 


Nowadays everything is utf-8. But convince yourself:



$ hexdump -C umlaut-utf8.txt 
00000000 c3 a4 0a |...|
00000003


Compare with https://en.wikipedia.org/wiki/Ä#Computer_encoding



Convert to the other encodings:



$ iconv -f utf8 -t iso88591 umlaut-utf8.txt > umlaut-iso88591.txt 
$ iconv -f utf8 -t utf16 umlaut-utf8.txt > umlaut-utf16.txt


Check the hex dump:



$ hexdump -C umlaut-iso88591.txt 
00000000 e4 0a |..|
00000002
$ hexdump -C umlaut-utf16.txt
00000000 ff fe e4 00 0a 00 |......|
00000006


Create something "invalid" by mixing all three:



$ cat umlaut-iso88591.txt umlaut-utf8.txt umlaut-utf16.txt > umlaut-mixed.txt 


What file says:



$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-mixed.txt: application/octet-stream; charset=binary
umlaut-utf16.txt: text/plain; charset=utf-16le
umlaut-utf8.txt: text/plain; charset=utf-8


without -i:



$ file *
umlaut-iso88591.txt: ISO-8859 text
umlaut-mixed.txt: data
umlaut-utf16.txt: Little-endian UTF-16 Unicode text, with no line terminators
umlaut-utf8.txt: UTF-8 Unicode text


The file command has no idea of "valid" or "invalid". It just sees some bytes and tries to guess what the encoding might be. As humans we might be able to recognize that a file is a text file with some umlauts in a "wrong" encoding. But as a computer it would need some sort of artificial intelligence.



One might argue that the heuristics of file is some sort of artificial intelligence. Yet, even if it is, it is a very limited one.



Here is more information about the file command: http://www.linfo.org/file_command.html







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 30 '18 at 13:06

























answered Apr 19 '11 at 7:35









lesmanalesmana

14.3k105772




14.3k105772













  • Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    – Peter.O
    Apr 19 '11 at 9:21



















  • Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

    – Peter.O
    Apr 19 '11 at 9:21

















Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

– Peter.O
Apr 19 '11 at 9:21





Thanks, that worked... I had tried 'file, but without any option :( ... I've now also tried a mixof UTF-16 and UTF-8 and ISO-8859-1. file -i` reported unknown-8bit. So, this also seems to be the answer to: "How to detect an invalid/unknown encoding"

– Peter.O
Apr 19 '11 at 9:21













19














It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or in BIG-5, and so on.



Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.



There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.




  • file


  • Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.


  • Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.


If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.






share|improve this answer


























  • Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

    – Peter.O
    Apr 19 '11 at 23:02


















19














It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or in BIG-5, and so on.



Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.



There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.




  • file


  • Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.


  • Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.


If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.






share|improve this answer


























  • Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

    – Peter.O
    Apr 19 '11 at 23:02
















19












19








19







It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or in BIG-5, and so on.



Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.



There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.




  • file


  • Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.


  • Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.


If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.






share|improve this answer















It isn't always possible to find out for sure what the encoding of a text file is. For example, the byte sequence 303275 (c3 bd in hexadecimal) could be ý in UTF-8, or ý in latin1, or Ă˝ in latin2, or in BIG-5, and so on.



Some encodings have invalid byte sequences, so it's possible to rule them out for sure. This is true in particular of UTF-8; most texts in most 8-bit encodings are not valid UTF-8. You can test for valid UTF-8 with isutf8 from moreutils or with iconv -f utf-8 -t utf-8 >/dev/null, amongst others.



There are tools that try to guess the encoding of a text file. They can make mistakes, but they often work in practice as long as you don't deliberately try to fool them.




  • file


  • Perl Encode::Guess (part of the standard distribution) tries successive encodings on a byte string and returns the first encoding in which the string is valid text.


  • Enca is an encoding guesser and converter. You can give it a language name and text that you presume is in that language (the supported languages are mostly East European languages), and it tries to guess the encoding.


If there is metadata (HTML/XML charset=, TeX inputenc, emacs -*-coding-*-, …) in the file, advanced editors like Emacs or Vim are often able to parse that metadata. That's not easy to automate from the command line though.







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 13 '17 at 12:36









Community

1




1










answered Apr 19 '11 at 21:13









GillesGilles

533k12810721594




533k12810721594













  • Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

    – Peter.O
    Apr 19 '11 at 23:02





















  • Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

    – Peter.O
    Apr 19 '11 at 23:02



















Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

– Peter.O
Apr 19 '11 at 23:02







Thanks for the good overview... Yes, "best-guess" can be the only option when the encoding is not known... Using iconv, I just ran all 1168 encodings (including aliases) listed by iconv -l against one of my .htm files... There were 683 encodings which passed muster.. The file's actual charset=ISO-8859-1 ..made up of all bar one ASCII-range values.. The non-ASCII char was xA9 .

– Peter.O
Apr 19 '11 at 23:02




















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f11602%2fhow-can-i-test-the-encoding-of-a-text-file-is-it-valid-and-what-is-it%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Histoire des bourses de valeurs

Why is there Russian traffic in my log files?

Rename multiple files to decrement number in file name?