Count letter frequency in word list, excluding duplicates in the same word

I'm trying to find the most frequent letter in a list of words. I'm struggling with the algorithm because I need to count the letter frequency in a word only once skipping duplicates, so I need help finding a way to count the frequency of the letters in the entire list with only one occurrence per word, ignoring the second occurrence.

For example if i have:

words=["tree","bone","indigo","developer"]

The frequency will be:

letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}

As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.

This is the algorithm that I came up with, it's implemented in Python:

for word in words:

    count=0;



    for letter in word:

        if(letter.isalpha()):

            if((letters[letter.lower()] > 0  && count == 0) ||

               (letters[letter.lower()] == 0 && count == 0)):



                    letters[letter.lower()]+=1

                    count=1



            elif(letters[letter.lower()]==0 && count==1):   

                letters[letter.lower()]+=1

But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

6

I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago

Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago

1

@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago

@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago

add a comment |

For example if i have:

words=["tree","bone","indigo","developer"]

The frequency will be:

letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}

As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.

This is the algorithm that I came up with, it's implemented in Python:

for word in words:

    count=0;



    for letter in word:

        if(letter.isalpha()):

            if((letters[letter.lower()] > 0  && count == 0) ||

               (letters[letter.lower()] == 0 && count == 0)):



                    letters[letter.lower()]+=1

                    count=1



            elif(letters[letter.lower()]==0 && count==1):   

                letters[letter.lower()]+=1

But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

6

I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago

Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago

1

@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago

@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago

add a comment |

For example if i have:

words=["tree","bone","indigo","developer"]

The frequency will be:

letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}

As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.

This is the algorithm that I came up with, it's implemented in Python:

for word in words:

    count=0;



    for letter in word:

        if(letter.isalpha()):

            if((letters[letter.lower()] > 0  && count == 0) ||

               (letters[letter.lower()] == 0 && count == 0)):



                    letters[letter.lower()]+=1

                    count=1



            elif(letters[letter.lower()]==0 && count==1):   

                letters[letter.lower()]+=1

But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

For example if i have:

words=["tree","bone","indigo","developer"]

The frequency will be:

letters={a:0, b:1, c:0, d:2, e:3, f:0, g:1, h:0, i:1, j:0, k:0, l:1, m:0, n:2, o:3, p:1, q:0, r:2, s:0, t:1, u:0, v:1, w:0, x:0, y:0, z:0}

As you can see from the letters dictionary: 'e' is 3 and not 5 because if 'e' repeats more than once in the same word it should be ignored.

This is the algorithm that I came up with, it's implemented in Python:

for word in words:

    count=0;



    for letter in word:

        if(letter.isalpha()):

            if((letters[letter.lower()] > 0  && count == 0) ||

               (letters[letter.lower()] == 0 && count == 0)):



                    letters[letter.lower()]+=1

                    count=1



            elif(letters[letter.lower()]==0 && count==1):   

                letters[letter.lower()]+=1

But it still requires work and I can't think about anything else, I'd be glad to anyone who will help me to think about a working solution.

python algorithm

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

edited 19 hours ago

Prune

43.1k143456

edited 19 hours ago

Prune

43.1k143456

edited 19 hours ago

Prune

43.1k143456

asked 19 hours ago

MattGeek

1008

asked 19 hours ago

MattGeek

1008

asked 19 hours ago

MattGeek

1008

6

I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago

Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago

1

@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago

@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago

add a comment |

6

I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago

Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago

1

@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago

@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago

I would describe the requirement as counting "the number of words which include each letter".

– Stobor
14 hours ago

Related stackoverflow.com/questions/46486462/…

– Kasrâmvd
14 hours ago

@Stobor: Yes, and your description of the requirement also hints at a much simpler solution: Just iterate over the entire alphabet, and for each letter count how many words contain that letter.

– mbj
11 hours ago

@mbj Yep - that's the basis for my solution below. :) It's simpler, but it's a little bit slower than the solutions here, mostly because it has to try all the letters which are not in the words, as well as the ones which are...

– Stobor
7 hours ago

add a comment |

6 Answers
6

active

oldest

votes

A variation on @Primusa answer without using update:

from collections import Counter



words = ["tree", "bone", "indigo", "developer"]

counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

add a comment |

Create a counter object and then update it with sets for each word:

from collections import Counter

c = Counter()



for word in wordlist:

    c.update(set(word.lower()))

answered 19 hours ago

Primusa

5,0551427

2

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

2

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

|
show 2 more comments

One without Counter

words=["tree","bone","indigo","developer"]

d={}

for word in words:         # iterate over words

    for i in set(word):    # to remove the duplication of characters within word

        d[i]=d.get(i,0)+1

Output

{'b': 1,

 'd': 2,

 'e': 3,

 'g': 1,

 'i': 1,

 'l': 1,

 'n': 2,

 'o': 3,

 'p': 1,

 'r': 2,

 't': 1,

 'v': 1}

answered 19 hours ago

mad_

3,93511021

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

add a comment |

Comparing speed of the solutions presented so far:

def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d

My timing function (using different sizes for the list of words):

word_list = [

    'tree', 'bone', 'indigo', 'developer', 'python',

    'language', 'timeit', 'xerox', 'printer', 'offset',

]



for exp in range(5):

    words = word_list * 10**exp



    result_list = 

    for i in range(1, 4):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import words,  f{} as f'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec

       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec

     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec

    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec

   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

add a comment |

The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.

import string

counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

which produces a dict like this:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

Here's my update of Ralf's timings:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec

       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec

     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec

    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec

   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

based on the following code and the word list from https://github.com/dwyl/english-words/

import string

import timeit

import random

from collections import Counter



def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d





def f4(words):

    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 

    return d





with open('words.txt') as word_file:

    valid_words = set(word_file.read().split())



for exp in range(5):



    result_list = 

    for i in range(1, 5):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))



print(f4(random.sample(valid_words, 10000)))

print(f4(random.sample(valid_words, 1000)))

print(f4(random.sample(valid_words, 100)))

print(f4(random.sample(valid_words, 10)))

answered 13 hours ago

Stobor

32.7k55357

1

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

add a comment |

Try using a dictionary comprehension:

import string

print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

answered 4 hours ago

U9-Forward

14.1k21337

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54223703%2fcount-letter-frequency-in-word-list-excluding-duplicates-in-the-same-word%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

A variation on @Primusa answer without using update:

from collections import Counter



words = ["tree", "bone", "indigo", "developer"]

counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

add a comment |

A variation on @Primusa answer without using update:

from collections import Counter



words = ["tree", "bone", "indigo", "developer"]

counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

add a comment |

A variation on @Primusa answer without using update:

from collections import Counter



words = ["tree", "bone", "indigo", "developer"]

counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

A variation on @Primusa answer without using update:

from collections import Counter



words = ["tree", "bone", "indigo", "developer"]

counts = Counter(c for word in words for c in set(word.lower()) if c.isalpha())

Output

Counter({'e': 3, 'o': 3, 'r': 2, 'd': 2, 'n': 2, 'p': 1, 'i': 1, 'b': 1, 'v': 1, 'g': 1, 'l': 1, 't': 1})

Basically convert each word to a set and then iterate over each set.

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

edited 17 hours ago

answered 19 hours ago

Daniel Mesejo

16k21130

answered 19 hours ago

Daniel Mesejo

16k21130

answered 19 hours ago

Daniel Mesejo

16k21130

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

add a comment |

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

Thanks, this solution is most complete one.

– MattGeek
18 hours ago

add a comment |

Create a counter object and then update it with sets for each word:

from collections import Counter

c = Counter()



for word in wordlist:

    c.update(set(word.lower()))

answered 19 hours ago

Primusa

5,0551427

2

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

2

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

|
show 2 more comments

Create a counter object and then update it with sets for each word:

from collections import Counter

c = Counter()



for word in wordlist:

    c.update(set(word.lower()))

answered 19 hours ago

Primusa

5,0551427

2

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

2

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

|
show 2 more comments

Create a counter object and then update it with sets for each word:

from collections import Counter

c = Counter()



for word in wordlist:

    c.update(set(word.lower()))

answered 19 hours ago

Primusa

5,0551427

Create a counter object and then update it with sets for each word:

from collections import Counter

c = Counter()



for word in wordlist:

    c.update(set(word.lower()))

answered 19 hours ago

Primusa

5,0551427

answered 19 hours ago

Primusa

5,0551427

answered 19 hours ago

Primusa

5,0551427

answered 19 hours ago

Primusa

5,0551427

2

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

2

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

|
show 2 more comments

2

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

2

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

It would be helpful to compare the time complexity of this solution to the one provided by OP

– Jordan Singer
19 hours ago

@JordanSinger I think they're the same time complexity, both solutions iterate over every character in every word; mine just screens for duplicates using a set

– Primusa
18 hours ago

Right, I suggested that because OP was interested in efficiency.

– Jordan Singer
18 hours ago

I would rather c.update(filter(lambda x: x.isalpha(), set(word.lower())) or something like that

– Walter Tross
18 hours ago

@WalterTross the question states that the input is a list of words so I didn't consider punctuation or spaces, but did consider capital letters

– Primusa
18 hours ago

|
show 2 more comments

One without Counter

words=["tree","bone","indigo","developer"]

d={}

for word in words:         # iterate over words

    for i in set(word):    # to remove the duplication of characters within word

        d[i]=d.get(i,0)+1

Output

{'b': 1,

 'd': 2,

 'e': 3,

 'g': 1,

 'i': 1,

 'l': 1,

 'n': 2,

 'o': 3,

 'p': 1,

 'r': 2,

 't': 1,

 'v': 1}

answered 19 hours ago

mad_

3,93511021

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

add a comment |

One without Counter

words=["tree","bone","indigo","developer"]

d={}

for word in words:         # iterate over words

    for i in set(word):    # to remove the duplication of characters within word

        d[i]=d.get(i,0)+1

Output

{'b': 1,

 'd': 2,

 'e': 3,

 'g': 1,

 'i': 1,

 'l': 1,

 'n': 2,

 'o': 3,

 'p': 1,

 'r': 2,

 't': 1,

 'v': 1}

answered 19 hours ago

mad_

3,93511021

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

add a comment |

One without Counter

words=["tree","bone","indigo","developer"]

d={}

for word in words:         # iterate over words

    for i in set(word):    # to remove the duplication of characters within word

        d[i]=d.get(i,0)+1

Output

{'b': 1,

 'd': 2,

 'e': 3,

 'g': 1,

 'i': 1,

 'l': 1,

 'n': 2,

 'o': 3,

 'p': 1,

 'r': 2,

 't': 1,

 'v': 1}

answered 19 hours ago

mad_

3,93511021

One without Counter

words=["tree","bone","indigo","developer"]

d={}

for word in words:         # iterate over words

    for i in set(word):    # to remove the duplication of characters within word

        d[i]=d.get(i,0)+1

Output

{'b': 1,

 'd': 2,

 'e': 3,

 'g': 1,

 'i': 1,

 'l': 1,

 'n': 2,

 'o': 3,

 'p': 1,

 'r': 2,

 't': 1,

 'v': 1}

answered 19 hours ago

mad_

3,93511021

answered 19 hours ago

mad_

3,93511021

answered 19 hours ago

mad_

3,93511021

answered 19 hours ago

mad_

3,93511021

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

add a comment |

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

Thanks, for your effort. This might be useful to people who want to implement the algorithm on their own.

– MattGeek
18 hours ago

add a comment |

Comparing speed of the solutions presented so far:

def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d

My timing function (using different sizes for the list of words):

word_list = [

    'tree', 'bone', 'indigo', 'developer', 'python',

    'language', 'timeit', 'xerox', 'printer', 'offset',

]



for exp in range(5):

    words = word_list * 10**exp



    result_list = 

    for i in range(1, 4):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import words,  f{} as f'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec

       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec

     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec

    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec

   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

add a comment |

Comparing speed of the solutions presented so far:

def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d

My timing function (using different sizes for the list of words):

word_list = [

    'tree', 'bone', 'indigo', 'developer', 'python',

    'language', 'timeit', 'xerox', 'printer', 'offset',

]



for exp in range(5):

    words = word_list * 10**exp



    result_list = 

    for i in range(1, 4):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import words,  f{} as f'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec

       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec

     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec

    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec

   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

add a comment |

Comparing speed of the solutions presented so far:

def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d

My timing function (using different sizes for the list of words):

word_list = [

    'tree', 'bone', 'indigo', 'developer', 'python',

    'language', 'timeit', 'xerox', 'printer', 'offset',

]



for exp in range(5):

    words = word_list * 10**exp



    result_list = 

    for i in range(1, 4):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import words,  f{} as f'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec

       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec

     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec

    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec

   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

Comparing speed of the solutions presented so far:

def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d

My timing function (using different sizes for the list of words):

word_list = [

    'tree', 'bone', 'indigo', 'developer', 'python',

    'language', 'timeit', 'xerox', 'printer', 'offset',

]



for exp in range(5):

    words = word_list * 10**exp



    result_list = 

    for i in range(1, 4):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import words,  f{} as f'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))

The results:

        10 words | f1   0.0028 sec | f2   0.0012 sec | f3   0.0011 sec

       100 words | f1   0.0245 sec | f2   0.0082 sec | f3   0.0113 sec

     1,000 words | f1   0.2450 sec | f2   0.0812 sec | f3   0.1134 sec

    10,000 words | f1   2.4601 sec | f2   0.8113 sec | f3   1.1335 sec

   100,000 words | f1  24.4195 sec | f2   8.1828 sec | f3  11.2167 sec

The Counter with list comprehension (here as f2()) seems to be the fastest. Using counter.update() seems to be a slow point (here as f1()).

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

edited 18 hours ago

answered 18 hours ago

Ralf

5,0734933

answered 18 hours ago

Ralf

5,0734933

answered 18 hours ago

Ralf

5,0734933

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

add a comment |

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

@Primusa ups, my bad. I updated with new results, but the conclusion is the same...

– Ralf
18 hours ago

Thanks for this good comparison.

– MattGeek
18 hours ago

add a comment |

The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.

import string

counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

which produces a dict like this:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

Here's my update of Ralf's timings:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec

       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec

     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec

    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec

   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

based on the following code and the word list from https://github.com/dwyl/english-words/

import string

import timeit

import random

from collections import Counter



def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d





def f4(words):

    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 

    return d





with open('words.txt') as word_file:

    valid_words = set(word_file.read().split())



for exp in range(5):



    result_list = 

    for i in range(1, 5):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))



print(f4(random.sample(valid_words, 10000)))

print(f4(random.sample(valid_words, 1000)))

print(f4(random.sample(valid_words, 100)))

print(f4(random.sample(valid_words, 10)))

answered 13 hours ago

Stobor

32.7k55357

1

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

add a comment |

The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.

import string

counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

which produces a dict like this:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

Here's my update of Ralf's timings:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec

       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec

     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec

    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec

   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

based on the following code and the word list from https://github.com/dwyl/english-words/

import string

import timeit

import random

from collections import Counter



def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d





def f4(words):

    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 

    return d





with open('words.txt') as word_file:

    valid_words = set(word_file.read().split())



for exp in range(5):



    result_list = 

    for i in range(1, 5):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))



print(f4(random.sample(valid_words, 10000)))

print(f4(random.sample(valid_words, 1000)))

print(f4(random.sample(valid_words, 100)))

print(f4(random.sample(valid_words, 10)))

answered 13 hours ago

Stobor

32.7k55357

1

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

add a comment |

The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.

import string

counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

which produces a dict like this:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

Here's my update of Ralf's timings:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec

       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec

     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec

    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec

   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

based on the following code and the word list from https://github.com/dwyl/english-words/

import string

import timeit

import random

from collections import Counter



def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d





def f4(words):

    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 

    return d





with open('words.txt') as word_file:

    valid_words = set(word_file.read().split())



for exp in range(5):



    result_list = 

    for i in range(1, 5):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))



print(f4(random.sample(valid_words, 10000)))

print(f4(random.sample(valid_words, 1000)))

print(f4(random.sample(valid_words, 100)))

print(f4(random.sample(valid_words, 10)))

answered 13 hours ago

Stobor

32.7k55357

The other solutions are good, but they specifically don't include the letters with zero frequency. Here's an approach which does, but is approximately 2-3 times slower than the others.

import string

counts = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase}

which produces a dict like this:

{'a': 4, 'b': 2, 'c': 2, 'd': 4, 'e': 7, 'f': 2, 'g': 2, 'h': 3, 'i': 7, 'j': 0, 'k': 0, 'l': 4, 'm': 5, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 3, 't': 3, 'u': 2, 'v': 0, 'w': 3, 'x': 0, 'y': 2, 'z': 1}

Here's my update of Ralf's timings:

        10 words | f1   0.0004 sec | f2   0.0004 sec | f3   0.0003 sec | f4   0.0010 sec

       100 words | f1   0.0019 sec | f2   0.0014 sec | f3   0.0013 sec | f4   0.0034 sec

     1,000 words | f1   0.0180 sec | f2   0.0118 sec | f3   0.0140 sec | f4   0.0298 sec

    10,000 words | f1   0.1960 sec | f2   0.1278 sec | f3   0.1542 sec | f4   0.2648 sec

   100,000 words | f1   2.0859 sec | f2   1.3971 sec | f3   1.6815 sec | f4   3.5196 sec

based on the following code and the word list from https://github.com/dwyl/english-words/

import string

import timeit

import random

from collections import Counter



def f1(words):

    c = Counter()

    for word in words:

        c.update(set(word.lower()))

    return c



def f2(words):

    return Counter(

        c

        for word in words

        for c in set(word.lower()))



def f3(words):

    d = {}

    for word in words:

        for i in set(word.lower()):

            d[i] = d.get(i, 0) + 1

    return d





def f4(words):

    d = {c: len([w for w in words if c in w.lower()]) for c in string.ascii_lowercase} 

    return d





with open('words.txt') as word_file:

    valid_words = set(word_file.read().split())



for exp in range(5):



    result_list = 

    for i in range(1, 5):

        t = timeit.timeit(

            'f(words)',

            'from __main__ import f{} as f, valid_words, exp; import random; words = random.sample(valid_words, 10**exp)'.format(i),

            number=100)

        result_list.append((i, t))



    print('{:10,d} words | {}'.format(

        len(words),

        ' | '.join(

            'f{} {:8.4f} sec'.format(i, t) for i, t in result_list)))



print(f4(random.sample(valid_words, 10000)))

print(f4(random.sample(valid_words, 1000)))

print(f4(random.sample(valid_words, 100)))

print(f4(random.sample(valid_words, 10)))

answered 13 hours ago

Stobor

32.7k55357

answered 13 hours ago

Stobor

32.7k55357

answered 13 hours ago

Stobor

32.7k55357

answered 13 hours ago

Stobor

32.7k55357

1

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

add a comment |

1

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

But this is ASCII only -- in this day and age?

– Janne Karila
7 hours ago

add a comment |

Try using a dictionary comprehension:

import string

print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

answered 4 hours ago

U9-Forward

14.1k21337

add a comment |

Try using a dictionary comprehension:

import string

print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

answered 4 hours ago

U9-Forward

14.1k21337

add a comment |

Try using a dictionary comprehension:

import string

print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

answered 4 hours ago

U9-Forward

14.1k21337

Try using a dictionary comprehension:

import string

print({k:max(i.count(k) for i in words) for k in string.ascii_lowercase})

answered 4 hours ago

U9-Forward

14.1k21337

answered 4 hours ago

U9-Forward

14.1k21337

answered 4 hours ago

U9-Forward

14.1k21337

answered 4 hours ago

U9-Forward

14.1k21337

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cdtjkyj