How to split a CSV file per initial column (with headers)?












2















This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:



id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]


to this content in 1.csv:



id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e


, this content in 2.csv:



id,first,second,third
2,d,e,f


, and this content in 333.csv:



id,first,second,third
333,b,b,b
333,a,a,a


, that is:




  1. Put all the lines with ID of N into N.csv.


  2. Keep the sequence of lines as in the original.


  3. Include the header from the original file in all the output files.


This must also be really fast, so a while read loop is not going to cut it.










share|improve this question

























  • Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

    – Sparhawk
    Feb 9 '17 at 12:50






  • 1





    As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

    – don_crissti
    Feb 9 '17 at 12:52











  • @Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

    – l0b0
    Feb 9 '17 at 18:20
















2















This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:



id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]


to this content in 1.csv:



id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e


, this content in 2.csv:



id,first,second,third
2,d,e,f


, and this content in 333.csv:



id,first,second,third
333,b,b,b
333,a,a,a


, that is:




  1. Put all the lines with ID of N into N.csv.


  2. Keep the sequence of lines as in the original.


  3. Include the header from the original file in all the output files.


This must also be really fast, so a while read loop is not going to cut it.










share|improve this question

























  • Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

    – Sparhawk
    Feb 9 '17 at 12:50






  • 1





    As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

    – don_crissti
    Feb 9 '17 at 12:52











  • @Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

    – l0b0
    Feb 9 '17 at 18:20














2












2








2








This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:



id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]


to this content in 1.csv:



id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e


, this content in 2.csv:



id,first,second,third
2,d,e,f


, and this content in 333.csv:



id,first,second,third
333,b,b,b
333,a,a,a


, that is:




  1. Put all the lines with ID of N into N.csv.


  2. Keep the sequence of lines as in the original.


  3. Include the header from the original file in all the output files.


This must also be really fast, so a while read loop is not going to cut it.










share|improve this question
















This is a combination of two other questions (how to split a file by each line prefix and how to split a file according to a column, including the header). I want to go from this content in input.csv:



id,first,second,third
1,a,b,c
333,b,b,b
1,d,e,f
2,d,e,f
1,c,d,e
333,a,a,a
[more lines in the same format]


to this content in 1.csv:



id,first,second,third
1,a,b,c
1,d,e,f
1,c,d,e


, this content in 2.csv:



id,first,second,third
2,d,e,f


, and this content in 333.csv:



id,first,second,third
333,b,b,b
333,a,a,a


, that is:




  1. Put all the lines with ID of N into N.csv.


  2. Keep the sequence of lines as in the original.


  3. Include the header from the original file in all the output files.


This must also be really fast, so a while read loop is not going to cut it.







shell text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 9 '17 at 13:37







l0b0

















asked Feb 9 '17 at 12:31









l0b0l0b0

28.4k19120248




28.4k19120248













  • Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

    – Sparhawk
    Feb 9 '17 at 12:50






  • 1





    As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

    – don_crissti
    Feb 9 '17 at 12:52











  • @Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

    – l0b0
    Feb 9 '17 at 18:20



















  • Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

    – Sparhawk
    Feb 9 '17 at 12:50






  • 1





    As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

    – don_crissti
    Feb 9 '17 at 12:52











  • @Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

    – l0b0
    Feb 9 '17 at 12:55













  • @don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

    – l0b0
    Feb 9 '17 at 18:20

















Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50





Could you please tell us what's wrong with applying the solution to the first link here? (And write the headers out first or later.)

– Sparhawk
Feb 9 '17 at 12:50




1




1





As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52





As a general rule (and my answer there is no exception) awk is fine if you're working with simple csv files (i.e. there are no delimiters embedded in the fields); if you're dealing with more complex csv files my advice is to use a proper tool (e.g. perl, python etc)

– don_crissti
Feb 9 '17 at 12:52













@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55







@Sparhawk Adding content at the top of files is at the very least tedious, especially if you have to do it thousands of times. Automation is best.

– l0b0
Feb 9 '17 at 12:55















@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55







@don_crissti Yes, definitely. I simply thought this would be a very common case where shell tools could be a simple solution. I've certainly had to do this several times in different contexts.

– l0b0
Feb 9 '17 at 12:55















@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20





@don_crissti Would you be interested in writing an answer with an example script which handles less regular input?

– l0b0
Feb 9 '17 at 18:20










4 Answers
4






active

oldest

votes


















6














This GNU awk command does the trick:



awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv


Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.



Explanation:





  • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.


  • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).


  • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).


  • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.


Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.






share|improve this answer





















  • 1





    There is f=$1".csv" missing before print >> f to let it work with unsorted input.

    – rudimeier
    Feb 9 '17 at 13:16








  • 1





    Using short option -F ',' would be more portable.

    – rudimeier
    Feb 9 '17 at 13:23



















3














(Sorry to spam you all with another answer)
For many situations, the elegant awk versions presented are perfect.
But there is life outside one-liners -- we often need more:




  • add extra code to cope with complex csv files;

  • add extra steps for normalization, reformatting, processing.


In the following skeleton, we use a Parser of CSV files.
This time we are avoiding one-ligners and even strictly declare the variables!



#!/usr/bin/perl

use strict;
use Parse::CSV;
my %dict=();

my $c = Parse::CSV->new(file => 'a1.csv');

while ( my $row = $c->fetch ) { ## for all records
$dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
}

for my $k (keys %dict){ ## create the cvs files
open(F,">","$k.cvs") or die;
print F $dict{$k};
close F;
}





  • The main advantage is that we can deal with more complex csv files;
    this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):


 1111,2,3
"3,3,3",a,"b, c, and d"
"a more, complex
multiline record",3,4



  • to exemplify a processing step, the field separator was changed to " :: "

  • to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.






share|improve this answer


























  • looks good (can't test it right now); thanks for the effort !

    – don_crissti
    Feb 10 '17 at 12:10



















1














This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...



awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'





share|improve this answer





















  • 1





    Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

    – l0b0
    Feb 9 '17 at 18:18








  • 1





    I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

    – JJoao
    Feb 9 '17 at 18:38













  • @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

    – JJoao
    Feb 10 '17 at 0:21



















0














old school version using only pipes and no awk:



warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file



cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'


which does:





  • cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys


  • fgrep -v id skip the header


  • sort | uniq sort and keep only one of each keys


  • xargs -n1 sh -c '<sub shell>' for each key execute a sub shell


  • head -n1 input.csv first part of the sub shell grabs the header of the input file

  • then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

  • and > ${0}.csv finally writes the output into a file named by the key






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f343720%2fhow-to-split-a-csv-file-per-initial-column-with-headers%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6














    This GNU awk command does the trick:



    awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv


    Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.



    Explanation:





    • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.


    • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).


    • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).


    • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.


    Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.






    share|improve this answer





















    • 1





      There is f=$1".csv" missing before print >> f to let it work with unsorted input.

      – rudimeier
      Feb 9 '17 at 13:16








    • 1





      Using short option -F ',' would be more portable.

      – rudimeier
      Feb 9 '17 at 13:23
















    6














    This GNU awk command does the trick:



    awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv


    Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.



    Explanation:





    • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.


    • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).


    • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).


    • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.


    Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.






    share|improve this answer





















    • 1





      There is f=$1".csv" missing before print >> f to let it work with unsorted input.

      – rudimeier
      Feb 9 '17 at 13:16








    • 1





      Using short option -F ',' would be more portable.

      – rudimeier
      Feb 9 '17 at 13:23














    6












    6








    6







    This GNU awk command does the trick:



    awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv


    Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.



    Explanation:





    • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.


    • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).


    • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).


    • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.


    Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.






    share|improve this answer















    This GNU awk command does the trick:



    awk -F ',' 'NR==1{h=$0; next};!seen[$1]++{f=$1".csv"; print h > f};{f=$1".csv"; print >> f; close(f)}' input.csv


    Caveat: This will not work if there are escaped commas in the first field. Commas in other fields should work fine.



    Explanation:





    • -F ',' (field separator) ensures that $1 etc. refer to the CSV columns rather than space separated values.


    • NR==1{h=$0; next} treats the first line specially (NR==1), by storing the full header line in a variable h (h=$0) and skipping the line (next).


    • !seen[$1]++{f=$1".csv"; print h > f} treats the first occurrence of any $1 specially (!seen[$1]) by storing $1 followed by .csv into a filename variable f and saving the header to that file (print h > f).


    • {f=$1".csv"; print >> f; close(f)} adds the current line to the file (print >> f) and closes the file descriptor (close(f)) to avoid keeping it around once processing of all lines with a specific ID is done.


    Bonus: If you replace $1 with another field it should do what you expect: Create a file per unique value in that column with the lines containing that value in the given column.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 9 '17 at 19:31

























    answered Feb 9 '17 at 12:34









    l0b0l0b0

    28.4k19120248




    28.4k19120248








    • 1





      There is f=$1".csv" missing before print >> f to let it work with unsorted input.

      – rudimeier
      Feb 9 '17 at 13:16








    • 1





      Using short option -F ',' would be more portable.

      – rudimeier
      Feb 9 '17 at 13:23














    • 1





      There is f=$1".csv" missing before print >> f to let it work with unsorted input.

      – rudimeier
      Feb 9 '17 at 13:16








    • 1





      Using short option -F ',' would be more portable.

      – rudimeier
      Feb 9 '17 at 13:23








    1




    1





    There is f=$1".csv" missing before print >> f to let it work with unsorted input.

    – rudimeier
    Feb 9 '17 at 13:16







    There is f=$1".csv" missing before print >> f to let it work with unsorted input.

    – rudimeier
    Feb 9 '17 at 13:16






    1




    1





    Using short option -F ',' would be more portable.

    – rudimeier
    Feb 9 '17 at 13:23





    Using short option -F ',' would be more portable.

    – rudimeier
    Feb 9 '17 at 13:23













    3














    (Sorry to spam you all with another answer)
    For many situations, the elegant awk versions presented are perfect.
    But there is life outside one-liners -- we often need more:




    • add extra code to cope with complex csv files;

    • add extra steps for normalization, reformatting, processing.


    In the following skeleton, we use a Parser of CSV files.
    This time we are avoiding one-ligners and even strictly declare the variables!



    #!/usr/bin/perl

    use strict;
    use Parse::CSV;
    my %dict=();

    my $c = Parse::CSV->new(file => 'a1.csv');

    while ( my $row = $c->fetch ) { ## for all records
    $dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
    }

    for my $k (keys %dict){ ## create the cvs files
    open(F,">","$k.cvs") or die;
    print F $dict{$k};
    close F;
    }





    • The main advantage is that we can deal with more complex csv files;
      this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):


     1111,2,3
    "3,3,3",a,"b, c, and d"
    "a more, complex
    multiline record",3,4



    • to exemplify a processing step, the field separator was changed to " :: "

    • to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.






    share|improve this answer


























    • looks good (can't test it right now); thanks for the effort !

      – don_crissti
      Feb 10 '17 at 12:10
















    3














    (Sorry to spam you all with another answer)
    For many situations, the elegant awk versions presented are perfect.
    But there is life outside one-liners -- we often need more:




    • add extra code to cope with complex csv files;

    • add extra steps for normalization, reformatting, processing.


    In the following skeleton, we use a Parser of CSV files.
    This time we are avoiding one-ligners and even strictly declare the variables!



    #!/usr/bin/perl

    use strict;
    use Parse::CSV;
    my %dict=();

    my $c = Parse::CSV->new(file => 'a1.csv');

    while ( my $row = $c->fetch ) { ## for all records
    $dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
    }

    for my $k (keys %dict){ ## create the cvs files
    open(F,">","$k.cvs") or die;
    print F $dict{$k};
    close F;
    }





    • The main advantage is that we can deal with more complex csv files;
      this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):


     1111,2,3
    "3,3,3",a,"b, c, and d"
    "a more, complex
    multiline record",3,4



    • to exemplify a processing step, the field separator was changed to " :: "

    • to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.






    share|improve this answer


























    • looks good (can't test it right now); thanks for the effort !

      – don_crissti
      Feb 10 '17 at 12:10














    3












    3








    3







    (Sorry to spam you all with another answer)
    For many situations, the elegant awk versions presented are perfect.
    But there is life outside one-liners -- we often need more:




    • add extra code to cope with complex csv files;

    • add extra steps for normalization, reformatting, processing.


    In the following skeleton, we use a Parser of CSV files.
    This time we are avoiding one-ligners and even strictly declare the variables!



    #!/usr/bin/perl

    use strict;
    use Parse::CSV;
    my %dict=();

    my $c = Parse::CSV->new(file => 'a1.csv');

    while ( my $row = $c->fetch ) { ## for all records
    $dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
    }

    for my $k (keys %dict){ ## create the cvs files
    open(F,">","$k.cvs") or die;
    print F $dict{$k};
    close F;
    }





    • The main advantage is that we can deal with more complex csv files;
      this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):


     1111,2,3
    "3,3,3",a,"b, c, and d"
    "a more, complex
    multiline record",3,4



    • to exemplify a processing step, the field separator was changed to " :: "

    • to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.






    share|improve this answer















    (Sorry to spam you all with another answer)
    For many situations, the elegant awk versions presented are perfect.
    But there is life outside one-liners -- we often need more:




    • add extra code to cope with complex csv files;

    • add extra steps for normalization, reformatting, processing.


    In the following skeleton, we use a Parser of CSV files.
    This time we are avoiding one-ligners and even strictly declare the variables!



    #!/usr/bin/perl

    use strict;
    use Parse::CSV;
    my %dict=();

    my $c = Parse::CSV->new(file => 'a1.csv');

    while ( my $row = $c->fetch ) { ## for all records
    $dict{$row->[0]} .= join(" :: ",@$row)."n"; ## process and save
    }

    for my $k (keys %dict){ ## create the cvs files
    open(F,">","$k.cvs") or die;
    print F $dict{$k};
    close F;
    }





    • The main advantage is that we can deal with more complex csv files;
      this time the csv input can have strings with ";", can include multiline fields (csv specification is complex!):


     1111,2,3
    "3,3,3",a,"b, c, and d"
    "a more, complex
    multiline record",3,4



    • to exemplify a processing step, the field separator was changed to " :: "

    • to exemplify extra steps we added some optimization: as we used a dict cache, this script runs 100 times faster than my other solution.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 10 '17 at 9:17

























    answered Feb 10 '17 at 0:15









    JJoaoJJoao

    7,3591928




    7,3591928













    • looks good (can't test it right now); thanks for the effort !

      – don_crissti
      Feb 10 '17 at 12:10



















    • looks good (can't test it right now); thanks for the effort !

      – don_crissti
      Feb 10 '17 at 12:10

















    looks good (can't test it right now); thanks for the effort !

    – don_crissti
    Feb 10 '17 at 12:10





    looks good (can't test it right now); thanks for the effort !

    – don_crissti
    Feb 10 '17 at 12:10











    1














    This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...



    awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'





    share|improve this answer





















    • 1





      Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

      – l0b0
      Feb 9 '17 at 18:18








    • 1





      I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

      – JJoao
      Feb 9 '17 at 18:38













    • @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

      – JJoao
      Feb 10 '17 at 0:21
















    1














    This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...



    awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'





    share|improve this answer





















    • 1





      Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

      – l0b0
      Feb 9 '17 at 18:18








    • 1





      I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

      – JJoao
      Feb 9 '17 at 18:38













    • @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

      – JJoao
      Feb 10 '17 at 0:21














    1












    1








    1







    This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...



    awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'





    share|improve this answer















    This is not an answer but just an avoid-scrolling variant of IObO's excellent answer...



    awk -F, 'NR==1{h=$0; next} {print seen[$1]++ ? $0 : h "n" $0 >$1 ".csv"}'






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 9 '17 at 18:49

























    answered Feb 9 '17 at 15:34









    JJoaoJJoao

    7,3591928




    7,3591928








    • 1





      Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

      – l0b0
      Feb 9 '17 at 18:18








    • 1





      I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

      – JJoao
      Feb 9 '17 at 18:38













    • @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

      – JJoao
      Feb 10 '17 at 0:21














    • 1





      Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

      – l0b0
      Feb 9 '17 at 18:18








    • 1





      I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

      – JJoao
      Feb 9 '17 at 18:38













    • @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

      – JJoao
      Feb 10 '17 at 0:21








    1




    1





    Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

    – l0b0
    Feb 9 '17 at 18:18







    Nice solution, but if I understand correctly since you don't close the file descriptor after writing it will keep it open for the duration of the command, possibly running out in case of input with many unique IDs.

    – l0b0
    Feb 9 '17 at 18:18






    1




    1





    I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

    – JJoao
    Feb 9 '17 at 18:38







    I tested it with 300.000 shufled lines and 100.000 unique Ids with no problems (15 minutes in my 8 year old machine), but I am sure I can get out of ??? (memory / inodes /disc space) if I increase...

    – JJoao
    Feb 9 '17 at 18:38















    @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

    – JJoao
    Feb 10 '17 at 0:21





    @don_crissti, and l0b0, I added a new answer following the chalange. I believe you can do it better... Feel free to improve and comment.

    – JJoao
    Feb 10 '17 at 0:21











    0














    old school version using only pipes and no awk:



    warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file



    cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'


    which does:





    • cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys


    • fgrep -v id skip the header


    • sort | uniq sort and keep only one of each keys


    • xargs -n1 sh -c '<sub shell>' for each key execute a sub shell


    • head -n1 input.csv first part of the sub shell grabs the header of the input file

    • then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

    • and > ${0}.csv finally writes the output into a file named by the key






    share|improve this answer




























      0














      old school version using only pipes and no awk:



      warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file



      cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'


      which does:





      • cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys


      • fgrep -v id skip the header


      • sort | uniq sort and keep only one of each keys


      • xargs -n1 sh -c '<sub shell>' for each key execute a sub shell


      • head -n1 input.csv first part of the sub shell grabs the header of the input file

      • then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

      • and > ${0}.csv finally writes the output into a file named by the key






      share|improve this answer


























        0












        0








        0







        old school version using only pipes and no awk:



        warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file



        cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'


        which does:





        • cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys


        • fgrep -v id skip the header


        • sort | uniq sort and keep only one of each keys


        • xargs -n1 sh -c '<sub shell>' for each key execute a sub shell


        • head -n1 input.csv first part of the sub shell grabs the header of the input file

        • then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

        • and > ${0}.csv finally writes the output into a file named by the key






        share|improve this answer













        old school version using only pipes and no awk:



        warning: it runs on average slower than above awk solutions by a factor of the number of keys in the input file



        cut -d , -f 1 input.csv | fgrep -v id | sort | uniq | xargs -n1 sh -c '(head -n1 input.csv && egrep "^${0}," input.csv) > ${0}.csv'


        which does:





        • cut -d , -f 1 input.csv splits each line of the file by the , char and grab the first column (-f 1) to keep only the keys


        • fgrep -v id skip the header


        • sort | uniq sort and keep only one of each keys


        • xargs -n1 sh -c '<sub shell>' for each key execute a sub shell


        • head -n1 input.csv first part of the sub shell grabs the header of the input file

        • then egrep "^${0}," input.csv grabs the lines matching the key, it might not be obvious, but that is a loop over each line, that's why it is slow

        • and > ${0}.csv finally writes the output into a file named by the key







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 10 hours ago









        loic.jaouenloic.jaouen

        1113




        1113






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f343720%2fhow-to-split-a-csv-file-per-initial-column-with-headers%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Loup dans la culture

            How to solve the problem of ntp “Unable to contact time server” from KDE?

            Connection limited (no internet access)