How to split a CSV file per initial column (with headers)?

This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:

id,first,second,third

1,a,b,c

333,b,b,b

1,d,e,f

2,d,e,f

1,c,d,e

333,a,a,a

[more lines in the same format]

to this content in 1.csv:

id,first,second,third

1,a,b,c

1,d,e,f

1,c,d,e

, this content in 2.csv:

id,first,second,third

2,d,e,f

, and this content in 333.csv:

id,first,second,third

333,b,b,b

333,a,a,a

, that is:

Put all the lines with ID of N into N.csv.

Keep the sequence of lines as in the original.

Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50

1

As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52

@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20

|
show 2 more comments

id,first,second,third

1,a,b,c

333,b,b,b

1,d,e,f

2,d,e,f

1,c,d,e

333,a,a,a

[more lines in the same format]

to this content in 1.csv:

id,first,second,third

1,a,b,c

1,d,e,f

1,c,d,e

, this content in 2.csv:

id,first,second,third

2,d,e,f

, and this content in 333.csv:

id,first,second,third

333,b,b,b

333,a,a,a

, that is:

Put all the lines with ID of N into N.csv.

Keep the sequence of lines as in the original.

Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50

1

As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52

@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20

|
show 2 more comments

id,first,second,third

1,a,b,c

333,b,b,b

1,d,e,f

2,d,e,f

1,c,d,e

333,a,a,a

[more lines in the same format]

to this content in 1.csv:

id,first,second,third

1,a,b,c

1,d,e,f

1,c,d,e

, this content in 2.csv:

id,first,second,third

2,d,e,f

, and this content in 333.csv:

id,first,second,third

333,b,b,b

333,a,a,a

, that is:

Put all the lines with ID of N into N.csv.

Keep the sequence of lines as in the original.

Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

id,first,second,third

1,a,b,c

333,b,b,b

1,d,e,f

2,d,e,f

1,c,d,e

333,a,a,a

[more lines in the same format]

to this content in 1.csv:

id,first,second,third

1,a,b,c

1,d,e,f

1,c,d,e

, this content in 2.csv:

id,first,second,third

2,d,e,f

, and this content in 333.csv:

id,first,second,third

333,b,b,b

333,a,a,a

, that is:

Put all the lines with ID of N into N.csv.

Keep the sequence of lines as in the original.

Include the header from the original file in all the output files.

This must also be really fast, so a while read loop is not going to cut it.

shell text-processing

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

edited Feb 9 '17 at 13:37

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

asked Feb 9 '17 at 12:31

l0b0

28.4k19120248

Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50

1

As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52

@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20

|
show 2 more comments

Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50

1

As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52

@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20

Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50

As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52

@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55

@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20

|
show 2 more comments

4 Answers
4

active

oldest

votes

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.

NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).

!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).

{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

1

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

1

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

add a comment |

(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:

add extra code to cope with complex csv files;

add extra steps for normalization, reformatting, processing.

In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!

#!/usr/bin/perl



use strict;

use Parse::CSV;

my %dict=();



my $c = Parse::CSV->new(file => 'a1.csv');



while ( my $row = $c->fetch ) {                    ## for all records

   $dict{$row->[0]} .=   join(" :: ",@$row)."n";  ## process and save

}



for my $k (keys %dict){                            ## create the cvs files

   open(F,">","$k.cvs") or die;

   print F $dict{$k};

   close F;

}

The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):

 1111,2,3

 "3,3,3",a,"b, c, and d"

 "a more, complex

        multiline record",3,4

to exemplify a processing step, the field separator was changed to " :: "

to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

add a comment |

This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...

awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

1

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

1

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

add a comment |

old school version using only pipes and no awk:

warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file

cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'

which does:

cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys

fgrep -v id skip the header

sort | uniq sort and keep only one of each keys

xargs -n1 sh -c '' for each key execute a sub shell

head -n1 input.csv first part of the sub shell grabs the header of the input file

then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

and > ${0}.csv finally writes the output into a file named by the key

answered 10 hours ago

loic.jaouen

1113

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f343720%2fhow-to-split-a-csv-file-per-initial-column-with-headers%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.

NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).

!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).

{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

1

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

1

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

add a comment |

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.

NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).

!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).

{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

1

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

1

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

add a comment |

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.

NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).

!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).

{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

This GNU awk command does the trick:

awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv

Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.

Explanation:

-F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.

NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).

!seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).

{f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.

Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

edited Feb 9 '17 at 19:31

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

answered Feb 9 '17 at 12:34

l0b0

28.4k19120248

1

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

1

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

add a comment |

1

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

1

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

There is f=$1".csv" missing before print >> f to let it work with unsorted input.

– rudimeier
Feb 9 '17 at 13:16

Using short option -F ',' would be more portable.

– rudimeier
Feb 9 '17 at 13:23

add a comment |

(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:

add extra code to cope with complex csv files;

add extra steps for normalization, reformatting, processing.

In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!

#!/usr/bin/perl



use strict;

use Parse::CSV;

my %dict=();



my $c = Parse::CSV->new(file => 'a1.csv');



while ( my $row = $c->fetch ) {                    ## for all records

   $dict{$row->[0]} .=   join(" :: ",@$row)."n";  ## process and save

}



for my $k (keys %dict){                            ## create the cvs files

   open(F,">","$k.cvs") or die;

   print F $dict{$k};

   close F;

}

The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):

 1111,2,3

 "3,3,3",a,"b, c, and d"

 "a more, complex

        multiline record",3,4

to exemplify a processing step, the field separator was changed to " :: "

to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

add a comment |

(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:

add extra code to cope with complex csv files;

add extra steps for normalization, reformatting, processing.

In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!

#!/usr/bin/perl



use strict;

use Parse::CSV;

my %dict=();



my $c = Parse::CSV->new(file => 'a1.csv');



while ( my $row = $c->fetch ) {                    ## for all records

   $dict{$row->[0]} .=   join(" :: ",@$row)."n";  ## process and save

}



for my $k (keys %dict){                            ## create the cvs files

   open(F,">","$k.cvs") or die;

   print F $dict{$k};

   close F;

}

The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):

 1111,2,3

 "3,3,3",a,"b, c, and d"

 "a more, complex

        multiline record",3,4

to exemplify a processing step, the field separator was changed to " :: "

to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

add a comment |

(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:

add extra code to cope with complex csv files;

add extra steps for normalization, reformatting, processing.

In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!

#!/usr/bin/perl



use strict;

use Parse::CSV;

my %dict=();



my $c = Parse::CSV->new(file => 'a1.csv');



while ( my $row = $c->fetch ) {                    ## for all records

   $dict{$row->[0]} .=   join(" :: ",@$row)."n";  ## process and save

}



for my $k (keys %dict){                            ## create the cvs files

   open(F,">","$k.cvs") or die;

   print F $dict{$k};

   close F;

}

The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):

 1111,2,3

 "3,3,3",a,"b, c, and d"

 "a more, complex

        multiline record",3,4

to exemplify a processing step, the field separator was changed to " :: "

to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:

add extra code to cope with complex csv files;

add extra steps for normalization, reformatting, processing.

In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!

#!/usr/bin/perl



use strict;

use Parse::CSV;

my %dict=();



my $c = Parse::CSV->new(file => 'a1.csv');



while ( my $row = $c->fetch ) {                    ## for all records

   $dict{$row->[0]} .=   join(" :: ",@$row)."n";  ## process and save

}



for my $k (keys %dict){                            ## create the cvs files

   open(F,">","$k.cvs") or die;

   print F $dict{$k};

   close F;

}

The main advantage is that we can deal with more complex csv files;
this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):

 1111,2,3

 "3,3,3",a,"b, c, and d"

 "a more, complex

        multiline record",3,4

to exemplify a processing step, the field separator was changed to " :: "

to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

edited Feb 10 '17 at 9:17

answered Feb 10 '17 at 0:15

JJoao

7,3591928

answered Feb 10 '17 at 0:15

JJoao

7,3591928

answered Feb 10 '17 at 0:15

JJoao

7,3591928

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

add a comment |

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

looks good (can't test it right now); thanks for the effort !

– don_crissti
Feb 10 '17 at 12:10

add a comment |

This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...

awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

1

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

1

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

add a comment |

This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...

awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

1

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

1

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

add a comment |

This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...

awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...

awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

edited Feb 9 '17 at 18:49

answered Feb 9 '17 at 15:34

JJoao

7,3591928

answered Feb 9 '17 at 15:34

JJoao

7,3591928

answered Feb 9 '17 at 15:34

JJoao

7,3591928

1

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

1

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

add a comment |

1

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

1

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

– l0b0
Feb 9 '17 at 18:18

I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

– JJoao
Feb 9 '17 at 18:38

@don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

– JJoao
Feb 10 '17 at 0:21

add a comment |

old school version using only pipes and no awk:

warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file

cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'

which does:

cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys

fgrep -v id skip the header

sort | uniq sort and keep only one of each keys

xargs -n1 sh -c '' for each key execute a sub shell

head -n1 input.csv first part of the sub shell grabs the header of the input file

then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

and > ${0}.csv finally writes the output into a file named by the key

answered 10 hours ago

loic.jaouen

1113

add a comment |

old school version using only pipes and no awk:

warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file

cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'

which does:

cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys

fgrep -v id skip the header

sort | uniq sort and keep only one of each keys

xargs -n1 sh -c '' for each key execute a sub shell

head -n1 input.csv first part of the sub shell grabs the header of the input file

then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

and > ${0}.csv finally writes the output into a file named by the key

answered 10 hours ago

loic.jaouen

1113

add a comment |

old school version using only pipes and no awk:

warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file

cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'

which does:

cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys

fgrep -v id skip the header

sort | uniq sort and keep only one of each keys

xargs -n1 sh -c '' for each key execute a sub shell

head -n1 input.csv first part of the sub shell grabs the header of the input file

then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

and > ${0}.csv finally writes the output into a file named by the key

answered 10 hours ago

loic.jaouen

1113

old school version using only pipes and no awk:

warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file

cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'

which does:

cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys

fgrep -v id skip the header

sort | uniq sort and keep only one of each keys

xargs -n1 sh -c '' for each key execute a sub shell

head -n1 input.csv first part of the sub shell grabs the header of the input file

then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

and > ${0}.csv finally writes the output into a file named by the key

answered 10 hours ago

loic.jaouen

1113

answered 10 hours ago

loic.jaouen

1113

answered 10 hours ago

loic.jaouen

1113

answered 10 hours ago

loic.jaouen

1113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RGwoB,WupE,2XXO1J2apblM4pFVC enRdSMXuk0,IdmdE1D Y tp a,hveDgz

搜尋此網誌

Cdtjkyj