Why is it not possible to search through text file contents encoded in UTF-16?
I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.
If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.
Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?
I'm on Xubuntu.
search unicode text
|
show 1 more comment
I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.
If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.
Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?
I'm on Xubuntu.
search unicode text
6
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness
– Fox
May 9 '17 at 15:52
2
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
1
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38
|
show 1 more comment
I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.
If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.
Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?
I'm on Xubuntu.
search unicode text
I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.
If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.
Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?
I'm on Xubuntu.
search unicode text
search unicode text
edited May 9 '17 at 21:53
Gilles
532k12810651592
532k12810651592
asked May 9 '17 at 15:33
EntenellerEnteneller
285
285
6
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness
– Fox
May 9 '17 at 15:52
2
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
1
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38
|
show 1 more comment
6
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness
– Fox
May 9 '17 at 15:52
2
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
1
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38
6
6
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness– Fox
May 9 '17 at 15:52
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness– Fox
May 9 '17 at 15:52
2
2
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
1
1
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38
|
show 1 more comment
2 Answers
2
active
oldest
votes
UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.
That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.
Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, theread()
andwrite()
system calls deal in bytes, so the interpretation of a character must be done in the tool.
– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
|
show 5 more comments
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f363946%2fwhy-is-it-not-possible-to-search-through-text-file-contents-encoded-in-utf-16%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.
That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.
Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, theread()
andwrite()
system calls deal in bytes, so the interpretation of a character must be done in the tool.
– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
|
show 5 more comments
UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.
That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.
Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, theread()
andwrite()
system calls deal in bytes, so the interpretation of a character must be done in the tool.
– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
|
show 5 more comments
UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.
That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.
Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.
UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.
That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.
Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.
answered May 9 '17 at 17:56
ilkkachuilkkachu
56.7k784156
56.7k784156
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, theread()
andwrite()
system calls deal in bytes, so the interpretation of a character must be done in the tool.
– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
|
show 5 more comments
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, theread()
andwrite()
system calls deal in bytes, so the interpretation of a character must be done in the tool.
– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.
– Astara
Aug 25 '17 at 2:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.
– ilkkachu
Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.
– Astara
Aug 26 '17 at 0:32
@Astara, well, the
read()
and write()
system calls deal in bytes, so the interpretation of a character must be done in the tool.– ilkkachu
Aug 26 '17 at 17:55
@Astara, well, the
read()
and write()
system calls deal in bytes, so the interpretation of a character must be done in the tool.– ilkkachu
Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.
– Astara
Aug 27 '17 at 15:44
|
show 5 more comments
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
answered 14 hours ago
kenorbkenorb
8,471370106
8,471370106
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f363946%2fwhy-is-it-not-possible-to-search-through-text-file-contents-encoded-in-utf-16%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
6
ripgrep
0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness– Fox
May 9 '17 at 15:52
2
See also utf8everywhere.com
– tripleee
May 9 '17 at 18:40
@Fox: thanks. ripgrep seems powerful.
– Enteneller
May 9 '17 at 21:19
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).
– Astara
Aug 25 '17 at 2:20
1
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression
– Fox
Aug 25 '17 at 2:38