Count letter frequency in word list, excluding duplicates in the same word












14















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question




















  • 6





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    14 hours ago











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    14 hours ago






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    11 hours ago











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    7 hours ago
















14















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question




















  • 6





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    14 hours ago











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    14 hours ago






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    11 hours ago











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    7 hours ago














14












14








14


3






I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.










share|improve this question
















I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.



For example if i have:



words=["tree","bone","indigo","developer"]


The frequency will be:



letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}


As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.



This is the algorithm that I came up with, it's implemented in Python:



for word in words:
count=0;

for letter in word:
if(letter.isalpha()):
if((letters[letter.lower()] > 0 && count == 0) ||
(letters[letter.lower()] == 0 && count == 0)):

letters[letter.lower()]+=1
count=1

elif(letters[letter.lower()]==0 && count==1):
letters[letter.lower()]+=1


But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.







python algorithm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 19 hours ago









Prune

43.1k143456




43.1k143456










asked 19 hours ago









MattGeekMattGeek

1008




1008








  • 6





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    14 hours ago











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    14 hours ago






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    11 hours ago











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    7 hours ago














  • 6





    I would describe the requirement as counting "the number of words which include each letter".

    – Stobor
    14 hours ago











  • Related stackoverflow.com/questions/46486462/…

    – Kasrâmvd
    14 hours ago






  • 1





    @Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

    – mbj
    11 hours ago











  • @mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

    – Stobor
    7 hours ago








6




6





I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago





I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago













Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago





Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago




1




1





@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago





@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago













@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago





@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago












6 Answers
6






active

oldest

votes


















26














A variation on @Primusa answer without using update:



from collections import Counter

words = ["tree", "bone", "indigo", "developer"]
counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


Output



Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


Basically convert each word to a set and then iterate over each set.






share|improve this answer


























  • Thanks, this solution is most complete one.

    – MattGeek
    18 hours ago



















11














Create a counter object and then update it with sets for each word:



from collections import Counter
c = Counter()

for word in wordlist:
c.update(set(word.lower()))





share|improve this answer



















  • 2





    It would be helpful to compare the time complexity of this solution to the one provided by OP

    – Jordan Singer
    19 hours ago






  • 2





    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

    – Primusa
    18 hours ago











  • Right, I suggested that because OP was interested in efficiency.

    – Jordan Singer
    18 hours ago











  • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

    – Walter Tross
    18 hours ago











  • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

    – Primusa
    18 hours ago



















9














One without Counter



words=["tree","bone","indigo","developer"]
d={}
for word in words: # iterate over words
for i in set(word): # to remove the duplication of characters within word
d[i]=d.get(i,0)+1


Output



{'b': 1,
'd': 2,
'e': 3,
'g': 1,
'i': 1,
'l': 1,
'n': 2,
'o': 3,
'p': 1,
'r': 2,
't': 1,
'v': 1}





share|improve this answer
























  • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

    – MattGeek
    18 hours ago



















3














Comparing speed of the solutions presented so far:



def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c

def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))

def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d


My timing function (using different sizes for the list of words):



word_list = [
'tree', 'bone', 'indigo', 'developer', 'python',
'language', 'timeit', 'xerox', 'printer', 'offset',
]

for exp in range(5):
words = word_list * 10**exp

result_list =
for i in range(1, 4):
t = timeit.timeit(
'f(words)',
'from __main__ import words, f{} as f'.format(i),
number=100)
result_list.append((i, t))

print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


The results:



        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






share|improve this answer


























  • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

    – Ralf
    18 hours ago











  • Thanks for this good comparison.

    – MattGeek
    18 hours ago



















0














The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



import string
counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


which produces a dict like this:



{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



Here's my update of Ralf's timings:



        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


based on the following code and the word list from https://github.com/dwyl/english-words/



import string
import timeit
import random
from collections import Counter

def f1(words):
c = Counter()
for word in words:
c.update(set(word.lower()))
return c

def f2(words):
return Counter(
c
for word in words
for c in set(word.lower()))

def f3(words):
d = {}
for word in words:
for i in set(word.lower()):
d[i] = d.get(i, 0) + 1
return d


def f4(words):
d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
return d


with open('words.txt') as word_file:
valid_words = set(word_file.read().split())

for exp in range(5):

result_list =
for i in range(1, 5):
t = timeit.timeit(
'f(words)',
'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
number=100)
result_list.append((i, t))

print('{:10,d} words | {}'.format(
len(words),
' | '.join(
'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

print(f4(random.sample(valid_words, 10000)))
print(f4(random.sample(valid_words, 1000)))
print(f4(random.sample(valid_words, 100)))
print(f4(random.sample(valid_words, 10)))






share|improve this answer



















  • 1





    But this is ASCII only -- in this day and age?

    – Janne Karila
    7 hours ago



















0














Try using a dictionary comprehension:



import string
print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    6 Answers
    6






    active

    oldest

    votes








    6 Answers
    6






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    26














    A variation on @Primusa answer without using update:



    from collections import Counter

    words = ["tree", "bone", "indigo", "developer"]
    counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


    Output



    Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


    Basically convert each word to a set and then iterate over each set.






    share|improve this answer


























    • Thanks, this solution is most complete one.

      – MattGeek
      18 hours ago
















    26














    A variation on @Primusa answer without using update:



    from collections import Counter

    words = ["tree", "bone", "indigo", "developer"]
    counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


    Output



    Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


    Basically convert each word to a set and then iterate over each set.






    share|improve this answer


























    • Thanks, this solution is most complete one.

      – MattGeek
      18 hours ago














    26












    26








    26







    A variation on @Primusa answer without using update:



    from collections import Counter

    words = ["tree", "bone", "indigo", "developer"]
    counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


    Output



    Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


    Basically convert each word to a set and then iterate over each set.






    share|improve this answer















    A variation on @Primusa answer without using update:



    from collections import Counter

    words = ["tree", "bone", "indigo", "developer"]
    counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())


    Output



    Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})


    Basically convert each word to a set and then iterate over each set.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited 17 hours ago

























    answered 19 hours ago









    Daniel MesejoDaniel Mesejo

    16k21130




    16k21130













    • Thanks, this solution is most complete one.

      – MattGeek
      18 hours ago



















    • Thanks, this solution is most complete one.

      – MattGeek
      18 hours ago

















    Thanks, this solution is most complete one.

    – MattGeek
    18 hours ago





    Thanks, this solution is most complete one.

    – MattGeek
    18 hours ago













    11














    Create a counter object and then update it with sets for each word:



    from collections import Counter
    c = Counter()

    for word in wordlist:
    c.update(set(word.lower()))





    share|improve this answer



















    • 2





      It would be helpful to compare the time complexity of this solution to the one provided by OP

      – Jordan Singer
      19 hours ago






    • 2





      @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

      – Primusa
      18 hours ago











    • Right, I suggested that because OP was interested in efficiency.

      – Jordan Singer
      18 hours ago











    • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

      – Walter Tross
      18 hours ago











    • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

      – Primusa
      18 hours ago
















    11














    Create a counter object and then update it with sets for each word:



    from collections import Counter
    c = Counter()

    for word in wordlist:
    c.update(set(word.lower()))





    share|improve this answer



















    • 2





      It would be helpful to compare the time complexity of this solution to the one provided by OP

      – Jordan Singer
      19 hours ago






    • 2





      @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

      – Primusa
      18 hours ago











    • Right, I suggested that because OP was interested in efficiency.

      – Jordan Singer
      18 hours ago











    • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

      – Walter Tross
      18 hours ago











    • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

      – Primusa
      18 hours ago














    11












    11








    11







    Create a counter object and then update it with sets for each word:



    from collections import Counter
    c = Counter()

    for word in wordlist:
    c.update(set(word.lower()))





    share|improve this answer













    Create a counter object and then update it with sets for each word:



    from collections import Counter
    c = Counter()

    for word in wordlist:
    c.update(set(word.lower()))






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered 19 hours ago









    PrimusaPrimusa

    5,0551427




    5,0551427








    • 2





      It would be helpful to compare the time complexity of this solution to the one provided by OP

      – Jordan Singer
      19 hours ago






    • 2





      @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

      – Primusa
      18 hours ago











    • Right, I suggested that because OP was interested in efficiency.

      – Jordan Singer
      18 hours ago











    • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

      – Walter Tross
      18 hours ago











    • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

      – Primusa
      18 hours ago














    • 2





      It would be helpful to compare the time complexity of this solution to the one provided by OP

      – Jordan Singer
      19 hours ago






    • 2





      @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

      – Primusa
      18 hours ago











    • Right, I suggested that because OP was interested in efficiency.

      – Jordan Singer
      18 hours ago











    • I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

      – Walter Tross
      18 hours ago











    • @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

      – Primusa
      18 hours ago








    2




    2





    It would be helpful to compare the time complexity of this solution to the one provided by OP

    – Jordan Singer
    19 hours ago





    It would be helpful to compare the time complexity of this solution to the one provided by OP

    – Jordan Singer
    19 hours ago




    2




    2





    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

    – Primusa
    18 hours ago





    @JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

    – Primusa
    18 hours ago













    Right, I suggested that because OP was interested in efficiency.

    – Jordan Singer
    18 hours ago





    Right, I suggested that because OP was interested in efficiency.

    – Jordan Singer
    18 hours ago













    I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

    – Walter Tross
    18 hours ago





    I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

    – Walter Tross
    18 hours ago













    @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

    – Primusa
    18 hours ago





    @WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

    – Primusa
    18 hours ago











    9














    One without Counter



    words=["tree","bone","indigo","developer"]
    d={}
    for word in words: # iterate over words
    for i in set(word): # to remove the duplication of characters within word
    d[i]=d.get(i,0)+1


    Output



    {'b': 1,
    'd': 2,
    'e': 3,
    'g': 1,
    'i': 1,
    'l': 1,
    'n': 2,
    'o': 3,
    'p': 1,
    'r': 2,
    't': 1,
    'v': 1}





    share|improve this answer
























    • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

      – MattGeek
      18 hours ago
















    9














    One without Counter



    words=["tree","bone","indigo","developer"]
    d={}
    for word in words: # iterate over words
    for i in set(word): # to remove the duplication of characters within word
    d[i]=d.get(i,0)+1


    Output



    {'b': 1,
    'd': 2,
    'e': 3,
    'g': 1,
    'i': 1,
    'l': 1,
    'n': 2,
    'o': 3,
    'p': 1,
    'r': 2,
    't': 1,
    'v': 1}





    share|improve this answer
























    • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

      – MattGeek
      18 hours ago














    9












    9








    9







    One without Counter



    words=["tree","bone","indigo","developer"]
    d={}
    for word in words: # iterate over words
    for i in set(word): # to remove the duplication of characters within word
    d[i]=d.get(i,0)+1


    Output



    {'b': 1,
    'd': 2,
    'e': 3,
    'g': 1,
    'i': 1,
    'l': 1,
    'n': 2,
    'o': 3,
    'p': 1,
    'r': 2,
    't': 1,
    'v': 1}





    share|improve this answer













    One without Counter



    words=["tree","bone","indigo","developer"]
    d={}
    for word in words: # iterate over words
    for i in set(word): # to remove the duplication of characters within word
    d[i]=d.get(i,0)+1


    Output



    {'b': 1,
    'd': 2,
    'e': 3,
    'g': 1,
    'i': 1,
    'l': 1,
    'n': 2,
    'o': 3,
    'p': 1,
    'r': 2,
    't': 1,
    'v': 1}






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered 19 hours ago









    mad_mad_

    3,93511021




    3,93511021













    • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

      – MattGeek
      18 hours ago



















    • Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

      – MattGeek
      18 hours ago

















    Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

    – MattGeek
    18 hours ago





    Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

    – MattGeek
    18 hours ago











    3














    Comparing speed of the solutions presented so far:



    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    My timing function (using different sizes for the list of words):



    word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
    ]

    for exp in range(5):
    words = word_list * 10**exp

    result_list =
    for i in range(1, 4):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import words, f{} as f'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


    The results:



            10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
    100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
    1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
    10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
    100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


    The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






    share|improve this answer


























    • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

      – Ralf
      18 hours ago











    • Thanks for this good comparison.

      – MattGeek
      18 hours ago
















    3














    Comparing speed of the solutions presented so far:



    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    My timing function (using different sizes for the list of words):



    word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
    ]

    for exp in range(5):
    words = word_list * 10**exp

    result_list =
    for i in range(1, 4):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import words, f{} as f'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


    The results:



            10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
    100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
    1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
    10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
    100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


    The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






    share|improve this answer


























    • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

      – Ralf
      18 hours ago











    • Thanks for this good comparison.

      – MattGeek
      18 hours ago














    3












    3








    3







    Comparing speed of the solutions presented so far:



    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    My timing function (using different sizes for the list of words):



    word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
    ]

    for exp in range(5):
    words = word_list * 10**exp

    result_list =
    for i in range(1, 4):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import words, f{} as f'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


    The results:



            10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
    100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
    1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
    10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
    100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


    The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).






    share|improve this answer















    Comparing speed of the solutions presented so far:



    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    My timing function (using different sizes for the list of words):



    word_list = [
    'tree', 'bone', 'indigo', 'developer', 'python',
    'language', 'timeit', 'xerox', 'printer', 'offset',
    ]

    for exp in range(5):
    words = word_list * 10**exp

    result_list =
    for i in range(1, 4):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import words, f{} as f'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))


    The results:



            10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec
    100 words | f1 0.0245 sec | f2 0.0082 sec | f3 0.0113 sec
    1,000 words | f1 0.2450 sec | f2 0.0812 sec | f3 0.1134 sec
    10,000 words | f1 2.4601 sec | f2 0.8113 sec | f3 1.1335 sec
    100,000 words | f1 24.4195 sec | f2 8.1828 sec | f3 11.2167 sec


    The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited 18 hours ago

























    answered 18 hours ago









    RalfRalf

    5,0734933




    5,0734933













    • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

      – Ralf
      18 hours ago











    • Thanks for this good comparison.

      – MattGeek
      18 hours ago



















    • @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

      – Ralf
      18 hours ago











    • Thanks for this good comparison.

      – MattGeek
      18 hours ago

















    @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

    – Ralf
    18 hours ago





    @Primusa ups, my bad. I updated with new results, but the conclusion is the same...

    – Ralf
    18 hours ago













    Thanks for this good comparison.

    – MattGeek
    18 hours ago





    Thanks for this good comparison.

    – MattGeek
    18 hours ago











    0














    The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



    import string
    counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


    which produces a dict like this:



    {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



    Here's my update of Ralf's timings:



            10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
    100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
    1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
    10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
    100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


    based on the following code and the word list from https://github.com/dwyl/english-words/



    import string
    import timeit
    import random
    from collections import Counter

    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    def f4(words):
    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
    return d


    with open('words.txt') as word_file:
    valid_words = set(word_file.read().split())

    for exp in range(5):

    result_list =
    for i in range(1, 5):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

    print(f4(random.sample(valid_words, 10000)))
    print(f4(random.sample(valid_words, 1000)))
    print(f4(random.sample(valid_words, 100)))
    print(f4(random.sample(valid_words, 10)))






    share|improve this answer



















    • 1





      But this is ASCII only -- in this day and age?

      – Janne Karila
      7 hours ago
















    0














    The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



    import string
    counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


    which produces a dict like this:



    {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



    Here's my update of Ralf's timings:



            10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
    100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
    1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
    10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
    100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


    based on the following code and the word list from https://github.com/dwyl/english-words/



    import string
    import timeit
    import random
    from collections import Counter

    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    def f4(words):
    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
    return d


    with open('words.txt') as word_file:
    valid_words = set(word_file.read().split())

    for exp in range(5):

    result_list =
    for i in range(1, 5):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

    print(f4(random.sample(valid_words, 10000)))
    print(f4(random.sample(valid_words, 1000)))
    print(f4(random.sample(valid_words, 100)))
    print(f4(random.sample(valid_words, 10)))






    share|improve this answer



















    • 1





      But this is ASCII only -- in this day and age?

      – Janne Karila
      7 hours ago














    0












    0








    0







    The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



    import string
    counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


    which produces a dict like this:



    {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



    Here's my update of Ralf's timings:



            10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
    100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
    1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
    10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
    100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


    based on the following code and the word list from https://github.com/dwyl/english-words/



    import string
    import timeit
    import random
    from collections import Counter

    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    def f4(words):
    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
    return d


    with open('words.txt') as word_file:
    valid_words = set(word_file.read().split())

    for exp in range(5):

    result_list =
    for i in range(1, 5):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

    print(f4(random.sample(valid_words, 10000)))
    print(f4(random.sample(valid_words, 1000)))
    print(f4(random.sample(valid_words, 100)))
    print(f4(random.sample(valid_words, 10)))






    share|improve this answer













    The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.



    import string
    counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}


    which produces a dict like this:



    {'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}



    Here's my update of Ralf's timings:



            10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec
    100 words | f1 0.0019 sec | f2 0.0014 sec | f3 0.0013 sec | f4 0.0034 sec
    1,000 words | f1 0.0180 sec | f2 0.0118 sec | f3 0.0140 sec | f4 0.0298 sec
    10,000 words | f1 0.1960 sec | f2 0.1278 sec | f3 0.1542 sec | f4 0.2648 sec
    100,000 words | f1 2.0859 sec | f2 1.3971 sec | f3 1.6815 sec | f4 3.5196 sec


    based on the following code and the word list from https://github.com/dwyl/english-words/



    import string
    import timeit
    import random
    from collections import Counter

    def f1(words):
    c = Counter()
    for word in words:
    c.update(set(word.lower()))
    return c

    def f2(words):
    return Counter(
    c
    for word in words
    for c in set(word.lower()))

    def f3(words):
    d = {}
    for word in words:
    for i in set(word.lower()):
    d[i] = d.get(i, 0) + 1
    return d


    def f4(words):
    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}
    return d


    with open('words.txt') as word_file:
    valid_words = set(word_file.read().split())

    for exp in range(5):

    result_list =
    for i in range(1, 5):
    t = timeit.timeit(
    'f(words)',
    'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),
    number=100)
    result_list.append((i, t))

    print('{:10,d} words | {}'.format(
    len(words),
    ' | '.join(
    'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

    print(f4(random.sample(valid_words, 10000)))
    print(f4(random.sample(valid_words, 1000)))
    print(f4(random.sample(valid_words, 100)))
    print(f4(random.sample(valid_words, 10)))







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered 13 hours ago









    StoborStobor

    32.7k55357




    32.7k55357








    • 1





      But this is ASCII only -- in this day and age?

      – Janne Karila
      7 hours ago














    • 1





      But this is ASCII only -- in this day and age?

      – Janne Karila
      7 hours ago








    1




    1





    But this is ASCII only -- in this day and age?

    – Janne Karila
    7 hours ago





    But this is ASCII only -- in this day and age?

    – Janne Karila
    7 hours ago











    0














    Try using a dictionary comprehension:



    import string
    print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





    share|improve this answer




























      0














      Try using a dictionary comprehension:



      import string
      print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





      share|improve this answer


























        0












        0








        0







        Try using a dictionary comprehension:



        import string
        print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})





        share|improve this answer













        Try using a dictionary comprehension:



        import string
        print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 4 hours ago









        U9-ForwardU9-Forward

        14.1k21337




        14.1k21337






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Loup dans la culture

            How to solve the problem of ntp “Unable to contact time server” from KDE?

            Connection limited (no internet access)