How to do a regex search in a UTF-16LE file while in a UTF-8 locale?
EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,
Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A
is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ
? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.
Here is sample pattern for this 3-character string ऊपर
:
x09x0Ax09x2Ax09x30
but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in x00x01x...etc
format, I have encountered problems for some values. I've been using sed
(and experimented with grep
), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.
eg. In this text ऊ
(UNICODE 090A), though it is a single character, ऊ
is perceived as two ASCII chars x09
and x0A
.
grep
has a -P
(perl) option which can search for x00x...
patterns, but I'm getting the same ASCII interpretation.
Is there some way to use grep -P
to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.
grep
seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.
PS; My ऊ
example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl
has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
text-processing grep regular-expression perl unicode
add a comment |
EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,
Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A
is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ
? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.
Here is sample pattern for this 3-character string ऊपर
:
x09x0Ax09x2Ax09x30
but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in x00x01x...etc
format, I have encountered problems for some values. I've been using sed
(and experimented with grep
), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.
eg. In this text ऊ
(UNICODE 090A), though it is a single character, ऊ
is perceived as two ASCII chars x09
and x0A
.
grep
has a -P
(perl) option which can search for x00x...
patterns, but I'm getting the same ASCII interpretation.
Is there some way to use grep -P
to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.
grep
seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.
PS; My ऊ
example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl
has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
text-processing grep regular-expression perl unicode
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,[ña-z]
will probably do surpising stuff, and so willgñ*
org[ñn]u
, butg(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.
– vonbrand
Jan 23 '13 at 14:30
add a comment |
EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,
Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A
is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ
? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.
Here is sample pattern for this 3-character string ऊपर
:
x09x0Ax09x2Ax09x30
but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in x00x01x...etc
format, I have encountered problems for some values. I've been using sed
(and experimented with grep
), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.
eg. In this text ऊ
(UNICODE 090A), though it is a single character, ऊ
is perceived as two ASCII chars x09
and x0A
.
grep
has a -P
(perl) option which can search for x00x...
patterns, but I'm getting the same ASCII interpretation.
Is there some way to use grep -P
to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.
grep
seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.
PS; My ऊ
example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl
has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
text-processing grep regular-expression perl unicode
EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,
Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE x09x0A
is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ
? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.
Here is sample pattern for this 3-character string ऊपर
:
x09x0Ax09x2Ax09x30
but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in x00x01x...etc
format, I have encountered problems for some values. I've been using sed
(and experimented with grep
), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.
eg. In this text ऊ
(UNICODE 090A), though it is a single character, ऊ
is perceived as two ASCII chars x09
and x0A
.
grep
has a -P
(perl) option which can search for x00x...
patterns, but I'm getting the same ASCII interpretation.
Is there some way to use grep -P
to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.
grep
seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.
PS; My ऊ
example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl
has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
text-processing grep regular-expression perl unicode
text-processing grep regular-expression perl unicode
edited Apr 13 '17 at 12:36
Community♦
1
1
asked Jun 9 '12 at 10:44
Peter.OPeter.O
18.9k1791144
18.9k1791144
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,[ña-z]
will probably do surpising stuff, and so willgñ*
org[ñn]u
, butg(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.
– vonbrand
Jan 23 '13 at 14:30
add a comment |
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,[ña-z]
will probably do surpising stuff, and so willgñ*
org[ñn]u
, butg(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.
– vonbrand
Jan 23 '13 at 14:30
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,
[ña-z]
will probably do surpising stuff, and so will gñ*
or g[ñn]u
, but g(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.– vonbrand
Jan 23 '13 at 14:30
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,
[ña-z]
will probably do surpising stuff, and so will gñ*
or g[ñn]u
, but g(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.– vonbrand
Jan 23 '13 at 14:30
add a comment |
3 Answers
3
active
oldest
votes
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you providegrep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian orderx0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with thex0Ax09
combination, which I do encounter.
– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grepraw byte data
, but I'm not sure yet.
– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
add a comment |
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from x09x0Ax09x2Ax09x30x00s09
to x{090A}x{092A}x{0930}x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
Your main loop can be rewritten aswhile (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f40375%2fhow-to-do-a-regex-search-in-a-utf-16le-file-while-in-a-utf-8-locale%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you providegrep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian orderx0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with thex0Ax09
combination, which I do encounter.
– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grepraw byte data
, but I'm not sure yet.
– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
add a comment |
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you providegrep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian orderx0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with thex0Ax09
combination, which I do encounter.
– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grepraw byte data
, but I'm not sure yet.
– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
add a comment |
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
My answer is essentially the same as in your other question on this topic:
$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern
As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.
edited Apr 13 '17 at 12:36
Community♦
1
1
answered Jun 9 '12 at 13:12
Warren YoungWarren Young
54.9k10143147
54.9k10143147
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you providegrep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian orderx0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with thex0Ax09
combination, which I do encounter.
– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grepraw byte data
, but I'm not sure yet.
– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
add a comment |
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you providegrep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to howgrep
sees the data. Keep in mind that internally,grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for@
is 0x0040, the Codepoint forऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian orderx0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with thex0Ax09
combination, which I do encounter.
– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grepraw byte data
, but I'm not sure yet.
– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in 'xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
– Peter.O
Jun 9 '12 at 13:34
I think you may be borrowing trouble. If you provide
grep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep
sees the data. Keep in mind that internally, grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.– Warren Young
Jun 9 '12 at 14:50
I think you may be borrowing trouble. If you provide
grep
a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep
sees the data. Keep in mind that internally, grep
is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.– Warren Young
Jun 9 '12 at 14:50
As the Codepoint for
@
is 0x0040, the Codepoint for ऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09
combination, which I do encounter.– Peter.O
Jun 9 '12 at 16:38
As the Codepoint for
@
is 0x0040, the Codepoint for ऊ
is 0x090A (U+090A). My patterns are flipped into Little-Endian order x0Ax09
which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the x0Ax09
combination, which I do encounter.– Peter.O
Jun 9 '12 at 16:38
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep
raw byte data
, but I'm not sure yet.– Peter.O
Jun 9 '12 at 16:39
Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep
raw byte data
, but I'm not sure yet.– Peter.O
Jun 9 '12 at 16:39
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
– Warren Young
Jun 9 '12 at 20:56
add a comment |
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from x09x0Ax09x2Ax09x30x00s09
to x{090A}x{092A}x{0930}x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
Your main loop can be rewritten aswhile (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
add a comment |
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from x09x0Ax09x2Ax09x30x00s09
to x{090A}x{092A}x{0930}x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
Your main loop can be rewritten aswhile (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
add a comment |
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from x09x0Ax09x2Ax09x30x00s09
to x{090A}x{092A}x{0930}x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from x09x0Ax09x2Ax09x30x00s09
to x{090A}x{092A}x{0930}x{0009}
It does everything in one process which is particularly what I was after.
#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }
edited Jun 10 '12 at 0:25
answered Jun 9 '12 at 23:19
Peter.OPeter.O
18.9k1791144
18.9k1791144
Your main loop can be rewritten aswhile (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
add a comment |
Your main loop can be rewritten aswhile (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
Your main loop can be rewritten as
while (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.– Warren Young
Jun 9 '12 at 23:52
Your main loop can be rewritten as
while (<$ifh>) { print $ofh $_ if /^$pat/; }
You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.– Warren Young
Jun 9 '12 at 23:52
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
@Warren, thanks for the help. I've changed the script to the simpler loop.
– Peter.O
Jun 10 '12 at 0:28
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
add a comment |
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
Install ripgrep
utility which supports UTF-16.
For example:
rg pattern filename
ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the
-E
/--encoding flag.
)
To print all lines, run: rg -N . filename
.
answered 14 hours ago
kenorbkenorb
8,471370106
8,471370106
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f40375%2fhow-to-do-a-regex-search-in-a-utf-16le-file-while-in-a-utf-8-locale%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g.,
[ña-z]
will probably do surpising stuff, and so willgñ*
org[ñn]u
, butg(ñ)*,
g(n|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.– vonbrand
Jan 23 '13 at 14:30