Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
164 views
in Technique[技术] by (71.8m points)

How does sorting with an index work in MongoDB?

I'm wondering how sorting with an index actually works in MongoDB. There are a couple articles in the MongoDB documentation, but they don't actually describe how the sort proceeds or the time complexity. Searches of SO and the interweb in general so far haven't turned up anything relevant.

Let's assume there are a documents in a collection, the find() clause matches b documents, there's a limit of c documents returned, a >> b >> c, and c is some suitably large number such that the returned set cannot fit in memory - let's say 1M documents, for example.

At the start of the operation, there exist b documents that needs to be sorted and a sorted tree index of size a for the feature the documents will be sorted by.

I can imagine:

A) traverse the index in order, and for each ObjectID traverse the list of b documents. Return matches until c is reached. This would be O(ab).

B) as A), but build a hashset of the ObjectIDs in the b documents first. This is O(a), but takes O(b) memory.

I've tried to consider sorts based on traversing the set of b documents, but can't seem to come up with anything faster than O(b log b), which is no better than sorting without an index.

I assume (but maybe I'm wrong) that every sort doesn't require an index scan, so how does the sort actually work?

Update:

Kevin's answer and provided link narrow down the question a lot, but I'd like to confirm/clarify a few points:

  1. As I understand it, you cannot use different indexes for the query and the sort if you want to avoid an in-memory sort. When I read this page it appeared as though you could (or at least, it didn't specify one way or the other), but that seems to be incorrect. Essentially, the documents are sorted because they're looked up in the order of the index during the query and therefore returned in the order of the index. Right?
  2. When querying on a compound index, the the sorting index must be the first index in the compound index, except for indexes where the query is an equality. If not, the sort is performed in memory. Right?
  3. How does sorting work with $in or $or queries? For example, assume the query is

    {a: {$in: [4, 6, 2, 1, 3, 10]}, b: {$gt: 1, $lt: 6}}

... and there's a compound index on a and b in that order. How would the sort work in the cases the sort is on a or b? $or is even more complicated since, as I understand it, $or queries are essentially split into multiple separate queries. Are $or queries always an in-memory sort, at least for merging the results of the separate queries?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Indexes in MongoDB are stored in a B-tree structure, where each index entry points to a specific location on-disk. Using a B-tree structure also means that a MongoDB index is stored in a sorted order, always traversed in-order, and is cheap for MongoDB to fetch a series of documents in a sorted order via indexes.

Update: The B-tree structure is true for the MMAPv1 storage engine, but is implemented slightly differently by the WiredTiger storage engine (default since MongoDB 3.2). The basic idea remains the same, where it's cheap to traverse the index in a sorted order.

A SORT stage (i.e. in-memory sort) in a query is limited to 32MB of memory use. A query will fail if the SORT stage exceeds this limit. This limit can be sidestepped by utilizing the sorted nature of indexes, so that MongoDB can return a query with a sort() parameter without performing an in-memory sort.

Let us assume that the query is of the shape:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort(...)

with collection a having an index of:

    db.a.createIndex({b:1,c:1})

There are two possible scenarios when a sort() stage is specified in the query:

1. MongoDB cannot use the sorted nature of the index and must perform an in-memory SORT stage.

This is the outcome if the query cannot use the "index prefix". For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({c:1})

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • However, there is no guarantee that the returned documents are sorted in terms of c.

Therefore, MongoDB has no choice but to perform an in-memory sort. The explain() output of this query will have a SORT stage. This SORT stage would be limited to 32MB of memory use.

2. MongoDB can use the sorted nature of the index.

This is the outcome if the query uses:

  • Sort keys that matches the order of the index, and
  • Specifies the same ordering as the index (i.e. the index {b:1,c:1} can be used for sort({b:1,c:1}) or sort({b:-1,c:-1}) but not sort({b:1,c:-1}))

For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({b:1})

In the query above, the index {b:1,c:1} can be used to:

  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • In this case, MongoDB can guarantee that the returned documents are sorted in terms of b.

The explain() output of the query above will not have a SORT stage. Also, the explain() output of the query with and without sort() are identical. In essence, we are getting the sort() for free.

A worthwhile resource to understand this subject is Optimizing MongoDB Compound Indexes. Please note that this blog post was written way back in 2012. Although some of the terminology may be outdated, the technicality of the post is still relevant.

Update on follow-up questions

  1. MongoDB uses only one index for most queries. So for example, to avoid an in-memory SORT stage in the query

    db.a.find({a:1}).sort({b:1})
    

    the index must cover both a and b fields at the same time; e.g. a compound index such as {a:1,b:1} is required. You cannot have two separate indexes {a:1} and {b:1}, and expect the {a:1} index to be used for the equality part, and the {b:1} index to be used for the sort part. In this case, MongoDB will choose one of the two indexes.

    Therefore, it is correct that the results are sorted because they are looked up and returned in the order of the index.

  2. To avoid having an in-memory sort using a compound index, the first part of the index must cater to the equality part of the query, and the second part must cater to the sort part of the query (as shown in the explanation for (1) above).

    If you have a query like this:

    db.a.find({}).sort({a:1})
    

    the index {a:1,b:1} can be used for the sort part (since you're basically returning the whole collection). And if your query looks like this:

    db.a.find({a:1}).sort({b:1})
    

    the same index {a:1,b:1} can also be used for both parts of the query. Also:

    db.a.find({a:1,b:1})
    

    can also use the same index {a:1,b:1}

    Notice the pattern here: the find() followed by sort() parameters follow the index order {a:1,b:1}. Therefore a compound index must be ordered by equality -> sort.

Update regarding sorting of different types

If a field has different types between documents (e.g. if a is string in one document, number in others, boolean in yet another), how do the sort proceed?

The answer is MongoDB BSON type comparison order. To paraphrase the manual page, the order is:

  1. MinKey (internal type)
  2. Null
  3. Numbers (ints, longs, doubles, decimals)
  4. Symbol, String
  5. Object
  6. Array
  7. BinData
  8. ObjectId
  9. Boolean
  10. Date
  11. Timestamp
  12. Regular Expression
  13. MaxKey (internal type)

So from the example above using ascending order, documents containing numbers will appear first, then strings, then boolean.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...