How to split a CSV file per initial column (with headers)?
This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv
:
id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]
to this content in 1.csv
:
id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e
, this content in 2.csv
:
id,first,second,third
2,d,e,f
, and this content in 333.csv
:
id,first,second,third
333,b,b,b
333,a,a,a
, that is:
- Put all the lines with ID of N into
N.csv
.
Keep the sequence of lines as in the original.
Include the header from the original file in all the output files.
This must also be really fast, so a while read
loop is not going to cut it.
shell text-processing
|
show 2 more comments
This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv
:
id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]
to this content in 1.csv
:
id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e
, this content in 2.csv
:
id,first,second,third
2,d,e,f
, and this content in 333.csv
:
id,first,second,third
333,b,b,b
333,a,a,a
, that is:
- Put all the lines with ID of N into
N.csv
.
Keep the sequence of lines as in the original.
Include the header from the original file in all the output files.
This must also be really fast, so a while read
loop is not going to cut it.
shell text-processing
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
1
As a general rule (and my answer there is no exception)awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)
– don_crissti
Feb 9 '17 at 12:52
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20
|
show 2 more comments
This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv
:
id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]
to this content in 1.csv
:
id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e
, this content in 2.csv
:
id,first,second,third
2,d,e,f
, and this content in 333.csv
:
id,first,second,third
333,b,b,b
333,a,a,a
, that is:
- Put all the lines with ID of N into
N.csv
.
Keep the sequence of lines as in the original.
Include the header from the original file in all the output files.
This must also be really fast, so a while read
loop is not going to cut it.
shell text-processing
This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv
:
id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]
to this content in 1.csv
:
id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e
, this content in 2.csv
:
id,first,second,third
2,d,e,f
, and this content in 333.csv
:
id,first,second,third
333,b,b,b
333,a,a,a
, that is:
- Put all the lines with ID of N into
N.csv
.
Keep the sequence of lines as in the original.
Include the header from the original file in all the output files.
This must also be really fast, so a while read
loop is not going to cut it.
shell text-processing
shell text-processing
edited Feb 9 '17 at 13:37
l0b0
asked Feb 9 '17 at 12:31
l0b0l0b0
28.4k19120248
28.4k19120248
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
1
As a general rule (and my answer there is no exception)awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)
– don_crissti
Feb 9 '17 at 12:52
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20
|
show 2 more comments
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
1
As a general rule (and my answer there is no exception)awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)
– don_crissti
Feb 9 '17 at 12:52
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
1
1
As a general rule (and my answer there is no exception)
awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)– don_crissti
Feb 9 '17 at 12:52
As a general rule (and my answer there is no exception)
awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)– don_crissti
Feb 9 '17 at 12:52
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20
|
show 2 more comments
4 Answers
4
active
oldest
votes
This GNU awk command does the trick:
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).
!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).
{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.
Bonus: If you replace $1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.
1
There isf=$1".csv"
missing beforeprint >> f
to let it work with unsorted input.
– rudimeier
Feb 9 '17 at 13:16
1
Using short option-F ','
would be more portable.
– rudimeier
Feb 9 '17 at 13:23
add a comment |
(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:
- add extra code to cope with complex csv files;
- add extra steps for normalization, reformatting, processing.
In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!
#!/usr/bin/perl
use strict;
use Parse::CSV;
my %dict=();
my $c = Parse::CSV->new(file => 'a1.csv');
while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}
for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}
- The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):
1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4
- to exemplify a processing step, the field separator was changed to " :: "
- to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
add a comment |
This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...
awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
add a comment |
old school version using only pipes and no awk
:
warning: it runs on average slower than above awk
solutions by a factor of the number of keys in the input file
cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'
which does:
cut -d , -f 1 input.csv
splits each line of the file by the,
char and grab the first column (-f 1
) to keep only the keys
fgrep -v id
skip the header
sort | uniq
sort and keep only one of each keys
xargs -n1 sh -c '<sub shell>'
for each key execute a sub shell
head -n1 input.csv
first part of the sub shell grabs the header of the input file- then
egrep "^${0}," input.csv
grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow
- and
> ${0}.csv
finally writes the output into a file named by the key
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f343720%2fhow-to-split-a-csv-file-per-initial-column-with-headers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
This GNU awk command does the trick:
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).
!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).
{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.
Bonus: If you replace $1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.
1
There isf=$1".csv"
missing beforeprint >> f
to let it work with unsorted input.
– rudimeier
Feb 9 '17 at 13:16
1
Using short option-F ','
would be more portable.
– rudimeier
Feb 9 '17 at 13:23
add a comment |
This GNU awk command does the trick:
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).
!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).
{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.
Bonus: If you replace $1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.
1
There isf=$1".csv"
missing beforeprint >> f
to let it work with unsorted input.
– rudimeier
Feb 9 '17 at 13:16
1
Using short option-F ','
would be more portable.
– rudimeier
Feb 9 '17 at 13:23
add a comment |
This GNU awk command does the trick:
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).
!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).
{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.
Bonus: If you replace $1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.
This GNU awk command does the trick:
awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv
Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.
Explanation:
-F ','
(field separator) ensures that$1
etc. refer to the CSV columns rather than space separated values.
NR==1{h=$0; next}
treats the first line specially (NR==1
), by storing the full header line in a variableh
(h=$0
) and skipping the line (next
).
!seen[$1]++{f=$1".csv"; print h > f}
treats the first occurrence of any$1
specially (!seen[$1]
) by storing$1
followed by.csv
into a filename variablef
and saving the header to that file (print h > f
).
{f=$1".csv"; print >> f; close(f)}
adds the current line to the file (print >> f
) and closes the file descriptor (close(f)
) to avoid keeping it around once processing of all lines with a specific ID is done.
Bonus: If you replace $1
with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.
edited Feb 9 '17 at 19:31
answered Feb 9 '17 at 12:34
l0b0l0b0
28.4k19120248
28.4k19120248
1
There isf=$1".csv"
missing beforeprint >> f
to let it work with unsorted input.
– rudimeier
Feb 9 '17 at 13:16
1
Using short option-F ','
would be more portable.
– rudimeier
Feb 9 '17 at 13:23
add a comment |
1
There isf=$1".csv"
missing beforeprint >> f
to let it work with unsorted input.
– rudimeier
Feb 9 '17 at 13:16
1
Using short option-F ','
would be more portable.
– rudimeier
Feb 9 '17 at 13:23
1
1
There is
f=$1".csv"
missing before print >> f
to let it work with unsorted input.– rudimeier
Feb 9 '17 at 13:16
There is
f=$1".csv"
missing before print >> f
to let it work with unsorted input.– rudimeier
Feb 9 '17 at 13:16
1
1
Using short option
-F ','
would be more portable.– rudimeier
Feb 9 '17 at 13:23
Using short option
-F ','
would be more portable.– rudimeier
Feb 9 '17 at 13:23
add a comment |
(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:
- add extra code to cope with complex csv files;
- add extra steps for normalization, reformatting, processing.
In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!
#!/usr/bin/perl
use strict;
use Parse::CSV;
my %dict=();
my $c = Parse::CSV->new(file => 'a1.csv');
while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}
for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}
- The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):
1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4
- to exemplify a processing step, the field separator was changed to " :: "
- to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
add a comment |
(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:
- add extra code to cope with complex csv files;
- add extra steps for normalization, reformatting, processing.
In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!
#!/usr/bin/perl
use strict;
use Parse::CSV;
my %dict=();
my $c = Parse::CSV->new(file => 'a1.csv');
while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}
for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}
- The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):
1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4
- to exemplify a processing step, the field separator was changed to " :: "
- to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
add a comment |
(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:
- add extra code to cope with complex csv files;
- add extra steps for normalization, reformatting, processing.
In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!
#!/usr/bin/perl
use strict;
use Parse::CSV;
my %dict=();
my $c = Parse::CSV->new(file => 'a1.csv');
while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}
for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}
- The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):
1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4
- to exemplify a processing step, the field separator was changed to " :: "
- to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.
(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:
- add extra code to cope with complex csv files;
- add extra steps for normalization, reformatting, processing.
In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!
#!/usr/bin/perl
use strict;
use Parse::CSV;
my %dict=();
my $c = Parse::CSV->new(file => 'a1.csv');
while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}
for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}
- The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):
1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4
- to exemplify a processing step, the field separator was changed to " :: "
- to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.
edited Feb 10 '17 at 9:17
answered Feb 10 '17 at 0:15
JJoaoJJoao
7,3591928
7,3591928
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
add a comment |
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
looks good (can't test it right now); thanks for the effort !
– don_crissti
Feb 10 '17 at 12:10
add a comment |
This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...
awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
add a comment |
This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...
awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
add a comment |
This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...
awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'
This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...
awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'
edited Feb 9 '17 at 18:49
answered Feb 9 '17 at 15:34
JJoaoJJoao
7,3591928
7,3591928
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
add a comment |
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
1
1
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.
– l0b0
Feb 9 '17 at 18:18
1
1
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...
– JJoao
Feb 9 '17 at 18:38
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.
– JJoao
Feb 10 '17 at 0:21
add a comment |
old school version using only pipes and no awk
:
warning: it runs on average slower than above awk
solutions by a factor of the number of keys in the input file
cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'
which does:
cut -d , -f 1 input.csv
splits each line of the file by the,
char and grab the first column (-f 1
) to keep only the keys
fgrep -v id
skip the header
sort | uniq
sort and keep only one of each keys
xargs -n1 sh -c '<sub shell>'
for each key execute a sub shell
head -n1 input.csv
first part of the sub shell grabs the header of the input file- then
egrep "^${0}," input.csv
grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow
- and
> ${0}.csv
finally writes the output into a file named by the key
add a comment |
old school version using only pipes and no awk
:
warning: it runs on average slower than above awk
solutions by a factor of the number of keys in the input file
cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'
which does:
cut -d , -f 1 input.csv
splits each line of the file by the,
char and grab the first column (-f 1
) to keep only the keys
fgrep -v id
skip the header
sort | uniq
sort and keep only one of each keys
xargs -n1 sh -c '<sub shell>'
for each key execute a sub shell
head -n1 input.csv
first part of the sub shell grabs the header of the input file- then
egrep "^${0}," input.csv
grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow
- and
> ${0}.csv
finally writes the output into a file named by the key
add a comment |
old school version using only pipes and no awk
:
warning: it runs on average slower than above awk
solutions by a factor of the number of keys in the input file
cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'
which does:
cut -d , -f 1 input.csv
splits each line of the file by the,
char and grab the first column (-f 1
) to keep only the keys
fgrep -v id
skip the header
sort | uniq
sort and keep only one of each keys
xargs -n1 sh -c '<sub shell>'
for each key execute a sub shell
head -n1 input.csv
first part of the sub shell grabs the header of the input file- then
egrep "^${0}," input.csv
grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow
- and
> ${0}.csv
finally writes the output into a file named by the key
old school version using only pipes and no awk
:
warning: it runs on average slower than above awk
solutions by a factor of the number of keys in the input file
cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'
which does:
cut -d , -f 1 input.csv
splits each line of the file by the,
char and grab the first column (-f 1
) to keep only the keys
fgrep -v id
skip the header
sort | uniq
sort and keep only one of each keys
xargs -n1 sh -c '<sub shell>'
for each key execute a sub shell
head -n1 input.csv
first part of the sub shell grabs the header of the input file- then
egrep "^${0}," input.csv
grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow
- and
> ${0}.csv
finally writes the output into a file named by the key
answered 10 hours ago
loic.jaouenloic.jaouen
1113
1113
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f343720%2fhow-to-split-a-csv-file-per-initial-column-with-headers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)
– Sparhawk
Feb 9 '17 at 12:50
1
As a general rule (and my answer there is no exception)
awk
is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)– don_crissti
Feb 9 '17 at 12:52
@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.
– l0b0
Feb 9 '17 at 12:55
@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?
– l0b0
Feb 9 '17 at 18:20