How to Use the Aggregation Pipeline in MongoDB

Readers like you help support MUO. When you make a purchase using links on our site, we may earn an affiliate commission. Read More.

The aggregation pipeline is the recommended way to run complex queries in MongoDB. If you've been using MongoDB's MapReduce, you better switch to the aggregation pipeline for more efficient computations.

What Is Aggregation in MongoDB and How Does It Work?

The aggregation pipeline is a multi-stage process for running advanced queries in MongoDB. It processes data through different stages called a pipeline. You can use the results generated from one level as an operation template in another.

For instance, you can pass the result of a match operation across to another stage for sorting in that order until you get the desired output.

Each stage of an aggregation pipeline features a MongoDB operator and generates one or more transformed documents. Depending on your query, a level can appear multiple times in the pipeline. For example, you might need to use the $count or $sort operator stages more than once across the aggregation pipeline.

The Stages of Aggregation Pipeline

The aggregation pipeline passes data through multiple stages in a single query. There are several stages and you can find their details in the MongoDB documentation.

Let's define some of the most commonly used ones below.

The $match Stage

This stage helps you define specific filtering conditions before starting the other aggregation stages. You can use it to select the matching data you want to include in the aggregation pipeline.

The $group Stage

The group stage separates data into different groups based on specific criteria using key-value pairs. Each group represents a key in the output document.

For example, consider the following sales sample data:

Using the aggregation pipeline, you can compute the total sales count and top sales for each product section:

 {
 $group: {
    _id: $Section,
    total_sales_count: {$sum : $Sold},
    top_sales: {$max: $Amount},
  }
}

The _id: $Section pair groups the output document based on the sections. By specifying the top_sales_count and top_sales fields, MongoDB creates fresh keys based on the operation defined by the aggregator; this can be $sum, $min, $max, or $avg.

The $skip Stage

You can use the $skip stage to omit a specified number of documents in the output. It usually comes after the group stage. For example, if you expect two output documents but skip one, the aggregation will only output the second document.

To add a skip stage, insert the $skip operation into the aggregation pipeline:

 ...,
{
    $skip: 1
  },

The $sort Stage

The sorting stage lets you arrange data in descending or ascending order. For instance, we can further sort the data in the previous query example in descending order to determine which section has the highest sales.

Add the $sort operator to the previous query:

 ...,
{
    $sort: {top_sales: -1}
  },

The $limit Stage

The limit operation helps reduce the number of output documents you want the aggregation pipeline to show. For example, use the $limit operator to get the section with the highest sales returned by the previous stage:

 ...,
{
    $sort: {top_sales: -1}
  },

{"$limit": 1}

The above returns only the first document; this is the section with the highest sales, as it appears at the top of the sorted output.

The $project Stage

The $project stage allows you to shape the output document as you like. Using the $project operator, you can specify which field to include in the output and customize its key name.

For instance, a sample output without the $project stage looks like so:

Sample unarranged data for aggregation pipeline

Let's see what it looks like with the $project stage. To add the $project to the pipeline:

 ...,

{
        "$project": {
            "_id": 0,
            "Section": "$_id",
            "TotalSold": "$total_sales_count",
            "TopSale": "$top_sales",

        }
    }

Since we've previously grouped the data based on product sections, the above includes each product section in the output document. It also ensures that the aggregated sales count and top sales feature in the output as TotalSold and TopSale.

The final output is a lot cleaner compared to the previous one:

Sample output for aggregation pipeline stages

The $unwind Stage

The $unwind stage breaks down an array within a document into individual documents. Take the following Orders data, for example:

Use the $unwind stage to deconstruct the items array before applying other aggregation stages. For example, unwinding the items array makes sense if you want to compute the total revenue for each product:

 db.Orders.aggregate(
[
  {
    "$unwind": "$items"
  },
  {
    "$group": {
      "_id": "$items.product",
      "total_revenue": { "$sum": { "$multiply": ["$items.quantity", "$items.price"] } }
    }
  },
  {
    "$sort": { "total_revenue": -1 }
  },

  {
        "$project": {
            "_id": 0,
            "Product": "$_id",
            "TotalRevenue": "$total_revenue",

        }
    }
])

Here's the result of the above aggregation query:

How to Create an Aggregation Pipeline in MongoDB

While the aggregation pipeline includes several operations, the previously featured stages give you an idea of how to apply them in the pipeline, including the basic query for each.

Using the previous sales data sample, let's have some of the stages discussed above in one piece for a broader view of the aggregation pipeline:

 db.sales.aggregate([

    {
        "$match": {
            "Sold": { "$gte": 5 }
            }
    },

        {

        "$group": {
            "_id": "$Section",
            "total_sales_count": { "$sum": "$Sold" },
            "top_sales": { "$max": "$Amount" },
            
        }

    },

    {
        "$sort": { "top_sales": -1 }
    },

    {"$skip": 0},

    {
        "$project": {
            "_id": 0,
            "Section": "$_id",
            "TotalSold": "$total_sales_count",
            "TopSale": "$top_sales",

        }
    }
    
])

The final output looks like something you've seen previously:

Aggregation Pipeline vs. MapReduce

Until its deprecation starting from MongoDB 5.0, the conventional way to aggregate data in MongoDB was via MapReduce. Although MapReduce has broader applications beyond MongoDB, it's less efficient than the aggregation pipeline, requiring third-party scripting to write the map and reduce functions separately.

The aggregation pipeline, on the other hand, is specific to MongoDB only. But it provides a cleaner and more efficient way to execute complex queries. Besides simplicity and query scalability, the featured pipeline stages make the output more customizable.

There are many more differences between the aggregation pipeline and MapReduce. You'll see them as you switch from MapReduce to the aggregation pipeline.

Make Big Data Queries Efficient in MongoDB

Your query must be as efficient as possible if you want to run in-depth calculations on complex data in MongoDB. The aggregation pipeline is ideal for advanced querying. Rather than manipulating data in separate operations, which often reduces performance, aggregation allows you to pack them all inside a single performant pipeline and execute them once.

While the aggregation pipeline is more efficient than MapReduce, you can make aggregation faster and more efficient by indexing your data. This limits the amount of data MongoDB needs to scan during each aggregation stage.