Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
393 views
in Technique[技术] by (71.8m points)

java - How to count the number of occurrences of each word?

If I have an article in English, or a novel in English, and I want to count how many times each words appears, what is the fastest algorithm written in Java?

Some people said you can use Map < String, Integer>() to complete this, but I was wondering how do I know what is the key words? Every article has different words and how do you know the "key" words then add one on its count?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is another way to do it with the things that appeared in Java 8:

private void countWords(final Path file) throws IOException {
    Arrays.stream(new String(Files.readAllBytes(file), StandardCharsets.UTF_8).split("\W+"))
        .collect(Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())).entrySet()
        .forEach(System.out::println);
}

So what is it doing?

  1. It reads a text file completely into memory, into a byte array to be more precise: Files.readAllBytes(file). This method turned up in Java 7 and allows methods of loading files very fast, however for the price that the file will be completely in memory, costing a lot of memory. For speed however this is a good appraoch.
  2. The byte[] is converted to a String: new String(Files.readAllBytes(file), StandardCharsets.UTF_8) while assuming that the file is UTF8 encoded. Change at your own need. The price is a full memory copy of the already huge piece of data in memory. It may be faster to work with a memory mapped file instead.
  3. The string is split at non-Word charcaters: ...split("\W+") which creates an array of strings with all your words.
  4. We create a stream from that array: Arrays.stream(...). This by itself does not do very much, but we can do a lot of fun things with the stream
  5. We group all the words together: Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting()). This means:
    • We want to group the words by the word themselves (identity()). We could also e.g. lowercase the string here first if you want grouping to be case insensitive. This will end up to be the key in a map.
    • As a result for storng the grouped values we want a TreeMap (TreeMap::new). TreeMaps are sorted by their key, so we can easily output in alphabetical order in the end. If you do not need sorting you could also use a HashMap here.
    • As value for each group we want to have the number of occurances of each word (counting()). In background that means that for each word we add to a group we increase the counter by one.
  6. From Step 5 we are left with a Map that maps words to their count. Now we just want to print them. So we access a collection with all the key/value pairs in this map (.entrySet()).
  7. Finally the actual printing. We say that each element should be passed to the println method: .forEach(System.out::println). And now you are left with a nice list.

So how good is this answer? The upside is that is is very short and thus highly expressive. It also gets along with only a single system call that hides behind Files.readAllBytes (or at least a fixed number I am not sure if this really works with a single system call) and System calls can be a bottleneck. E.g. if you are reading a file from a stream, each call to read may trigger a system call. This is significantly reduced by using a BufferedReader that as the name suggests buffers. but stilly readAllBytes should be fastest. The price for this is that it consumes huge amounts of memory. However wikipedia claims that a typical english book has 500 pages with 2,000 characters per page which mean roughly 1 Megabyte which should not be a problem in terms of memory consumption even if you are on a smartphone, raspberry pi or a really really old computer.

This solutions does involve some optimizations that were not possible prior to Java 8. For example the idiom map.put(word, map.get(word) + 1) requires the "word" to be looked up twicte in the map, which is an unnecessary waste.

But also a simple loop might be easier to optimize for the compiler and might save a number of method calls. So I wanted to know and put this to a test. I generated a file using:

[ -f /tmp/random.txt ] && rm /tmp/random.txt; for i in {1..15}; do head -n 10000 /usr/share/dict/american-english >> /tmp/random.txt; done; perl -MList::Util -e 'print List::Util::shuffle <>' /tmp/random.txt > /tmp/random.tmp; mv /tmp/random.tmp /tmp/random.txt

Which gives me a file of about 1,3MB, so not that untypical for a book with most words being repeated 15 times, but in random order to circumvent that this end up to be a branch prediction test. Then I ran the following tests:

public class WordCountTest {

    @Test(dataProvider = "provide_description_testMethod")
    public void test(String description, TestMethod testMethod) throws Exception {
        long start = System.currentTimeMillis();
        for (int i = 0; i < 100_000; i++) {
            testMethod.run();
        }
        System.out.println(description + " took " + (System.currentTimeMillis() - start) / 1000d + "s");
    }

    @DataProvider
    public Object[][] provide_description_testMethod() {
        Path path = Paths.get("/tmp/random.txt");
        return new Object[][]{
            {"classic", (TestMethod)() -> countWordsClassic(path)},
            {"mixed", (TestMethod)() -> countWordsMixed(path)},
            {"mixed2", (TestMethod)() -> countWordsMixed2(path)},
            {"stream", (TestMethod)() -> countWordsStream(path)},
            {"stream2", (TestMethod)() -> countWordsStream2(path)},
        };
    }

    private void countWordsClassic(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\W+")) {
            Integer oldCount = wordCounts.get(word);
            if (oldCount == null) {
                wordCounts.put(word, 1);
            } else {
                wordCounts.put(word, oldCount + 1);
            }
        }
    }

    private void countWordsMixed(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\W+")) {
            wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1);
        }
    }

    private void countWordsMixed2(final Path path) throws IOException {
        final Map<String, Integer> wordCounts = new HashMap<>();
        Pattern.compile("\W+")
            .splitAsStream(new String(readAllBytes(path), StandardCharsets.UTF_8))
            .forEach(word -> wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1));
    }

    private void countWordsStream2(final Path tmpFile) throws IOException {
        Pattern.compile("\W+").splitAsStream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
    }

    private void countWordsStream(final Path tmpFile) throws IOException {
        Arrays.stream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8).split("\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
    }

    interface TestMethod {
        void run() throws Exception;
    }
}

The result were:

type    length  diff
classic 4665s    +9%
mixed   4273s    +0%
mixed2  4833s    +13%
stream  4868s    +14%
stream2 5070s    +19%

Note that I previously also tested with TreeMaps, but found that the HashMaps were much faster, even if I sorted the output afterwards. Also I changed the tests above after Tagir Valeev told me in the comments below about the Pattern.splitAsStream() method. Since I got strongly varying results I left the tests run for quite a while as you can see by the length in seconds above to get meaningful results.

How I judge the results:

  1. The "mixed" approach which does not use streams at all, but uses the "merge" method with callback introduced in Java 8 does improve the performance. This is something I expected because the classic get/put appraoch requires the key to be looked up twice in the HashMap and this is not required anymore with the "merge"-approach.

  2. To my suprise the Pattern.splitAsStream() appraoch is actually slower compared to Arrays.asStream(....split()). I did have a look at the source code of both implementations and I noticed that the split() call saves the results in an ArrayList which starts with a size of zero and is enlarged as needed. This requires many copy operations and in the end another copy operation to copy the ArrayList to an array. But "splitAsStream" actually creates an iterator which I thought can be queried as needed avoiding these copy operations completely. I did not quite look through all the source that converts the iterator to a stream object, but it seems to be slow and I don't know why. In the end it theoretically could have to do with CPU memory caches: If exactly the same code is executed over and over again the code will more likely be in the cache then actually running on large function chains, but this is a very wild speculation on my side. It may also be something completely different. However splitAsStream MIGHT have a better memory footprint, maybe it does not, I did not profile that.

  3. The stream approach in general is pretty slow. This is not totally unexpected because quite a number of method invocations take place, including for example something as pointless as Function.identity. However I did not expect the difference at this magnitude.

As an interesting side note I find the mixed approach which was fastest quite well to read and understand. The call to "merge" does not have the most ovbious effect to me, but if you know what this method is doing it seems most readable to me while at the same time the groupingBy command is more difficult to understand for me. I guess one might be tempted to say that this groupingBy is so special and highly optimised that it makes sense to use it for performance but as demonstrated here, this is not the case.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...