Map-Reduce vs. Aggregation Pipeline in MongoDB

Q: What Else Should I Know About the Aggregation Pipeline?

<p><a href="https://www.makeuseof.com/aggregation-pipeline-in-mongodb/">MongoDB&#39;s aggregation pipeline is a multi-step process</a> that includes matching data, grouping it, and sorting it.</p>

Q: What Queries and Commands Can I Use With MongoDB?

<p>Although MongoDB is <a href="https://www.makeuseof.com/sql-vs-nosql-whats-the-best-database-for-your-next-project/">a NoSQL database</a>, it still supports <a href="https://www.makeuseof.com/mongodb-queries-and-operations/">many of the operations</a> you&rsquo;ll be familiar with from traditional RDBMS programs.</p>

Readers like you help support MUO. When you make a purchase using links on our site, we may earn an affiliate commission. Read More.

MapReduce and the aggregation pipeline are the two methods you can use to deal with complex data processing in MongoDB. The aggregation framework is newer and known for its efficiency. But some developers still prefer to stick to MapReduce, which they consider more comfortable.

Practically, you want to pick one of these complex query methods since they achieve the same goal. But how do they work? How are they different, and which should you use?

How MapReduce Works in MongoDB

MapReduce in MongoDB allows you to run complex calculations on a large volume of data and aggregate the result into a more comprehensive chunk. The MapReduce method features two functions: map and reduce.

While working with MapReduce in MongoDB, you'll specify the map and the reduce functions separately using JavaScript and insert each into the built-in mapReduce query.

The map function first splits the incoming data into key-value pairs—usually based on mapped grouping. This is where you specify how you want to group the data. The reduce function then runs custom calculations on the values in each data group and aggregates the result into a separate collection stored in the database.

How the Aggregation Pipeline Works in MongoDB

The aggregation pipeline in MongoDB is an improved alternative to MapReduce. Like MapReduce, it allows you to perform complex calculations and data transformations directly inside the database. But aggregation doesn't require writing dedicated JavaScript functions that can reduce query performance.

Instead, it uses built-in MongoDB operators to manipulate, group, and compute data. It then aggregates the results after each query. Thus, the aggregation pipeline is more customizable since you can structure the output as you like.

How Queries Differ Between MapReduce and Aggregation

Assume you want to calculate the total sales of items based on product categories. In the case of MapReduce and aggregation, the product categories become the keys, while the sums of the items under each category become the corresponding values.

Take some example raw data for the described problem statement, that looks like this:

Let's solve this problem scenario using MapReduce and an aggregation pipeline to differentiate between their queries and problem-solving methods.

The MapReduce Method

Using Python as the base programming language, the mapReduce query of the previously described problem scenario looks like this:

 import pymongo

client = pymongo.MongoClient(
    "mongodb://localhost/"
)

db = client.my_database

sales = db["sales"]

map_function = """
function() {
    emit(this.Section, this.Sold);
}
"""

reduce_function = """
function(key, values) {
    return Array.sum(values);
}
"""

result = db.command(
    "mapReduce",
    "sales",
    map=map_function,
    reduce=reduce_function,
    out="section_totals"
)

doc = [doc for doc in db.section_totals.find()]
print(doc)

If you run this against the original sample data, you'll see output like this:

 [{
  '_id': 'Adidas',
  'value': 9.0
},{
  '_id': 'Nike',
  'value': 12.0
}]

Look closely, and you should see that the map and reduce processors are JavaScript functions inside Python variables. The code passes these to the mapReduce query, which specifies a dedicated output collection (section_totals).

Using an Aggregation Pipeline

In addition to giving a smoother output, the aggregation pipeline query is more direct. Here's what the previous operation looks like with the aggregation pipeline:

 import pymongo
client = pymongo.MongoClient("mongodb://localhost/")
db = client.funmi
sales = db["sales"]

pipeline = [
    {
        "$group": {
            "_id": "$Section",
            "totalSold": { "$sum": "$Sold" }
        }
    },
    {
        "$project": {
            "_id": 0,
            "Section": "$_id",
            "TotalSold": "$totalSold"
        }
    }
]

result = list(sales.aggregate(pipeline))
print(result)

Running this aggregation query will give the following results, which are similar to the results from the MapReduce approach:

 [{
  'Section': 'Nike',
  'TotalSold': 12
},{
  'Section': 'Adidas',
  'TotalSold': 9
}]

Query Performance and Speed

The aggregation pipeline is an updated version of MapReduce. MongoDB recommends using the aggregation pipeline instead of MapReduce, as the former is more efficient.

We tried to assert this claim while running the queries in the previous section. And when executed side-by-side on a 12GB RAM machine, the aggregation pipeline appeared to be faster, averaging 0.014 seconds during execution. It took the same machine an average of 0.058 seconds to run the MapReduce query.

That's not a yardstick to conclude on their performances, but it appears to back up MongoDB's recommendation. You might consider this time difference insignificant, but it will add up considerably across thousands or millions of queries.

The Pros and Cons of MapReduce

Consider the upsides and downsides of MapReduce to determine where it excels in data processing.

Pros

It gives more flexibility for customization since you write the map and reduce functions separately.
You can easily save the output into a new MongoDB collection inside the database.
You can use MapReduce in distributed file systems like Hadoop, which easily integrates with MongoDB.
Its support for third-party scripting makes it more scalable and easy to learn than the aggregation pipeline. So someone with a JavaScript development background can implement MapReduce.

Cons

It requires third-party scripting; this contributes to its lower performance than the aggregation pipeline.
MapReduce can be memory inefficient, requiring several nodes, especially when dealing with overly complex data.
It's not suitable for real-time data processing since querying can be slow.

Pros and Cons of the Aggregation Pipeline

How about the aggregation pipeline? Considering its strengths and weaknesses provides more insight.

Pros

The query is multistage, usually shorter, more concise, and more readable.
The aggregation pipeline is more efficient, offering a significant improvement over MapReduce.
It supports built-in MongoDB operators that let you design your query flexibly.
It supports real-time data processing.
The aggregation pipeline is easily ingestible into MongoDB and doesn't require third-party scripting.
You can create a new MongoDB collection for the outputs if you need to save them.

Cons

It may not be as flexible as MapReduce when dealing with more complex data structures. Since it doesn't use third-party scripting, it caps you to a specific method of aggregating data.
Its implementation and learning curve can be challenging for developers with little or no experience with MongoDB.

When Should You Use MapReduce or Aggregation Pipeline?

Generally, it's best to consider your data processing requirements when choosing between MapReduce and the aggregation pipeline.

Ideally, if your data is more complex, requiring advanced logic and algorithms in a distributed file system, MapReduce can come in handy. This is because you can easily customize map-reduce functions and inject them into several nodes. Go for MapReduce if your data processing task requires horizontal scalability over efficiency.

On the other hand, the aggregation pipeline is more suitable for computing complex data that doesn't require custom logic or algorithms. If your data resides in MongoDB only, it makes sense to use the aggregation pipeline since it features many built-in operators.

The aggregation pipeline is also best for real-time data processing. If your computation requirement prioritizes efficiency over other factors, you want to opt for the aggregation pipeline.

Run Complex Computations in MongoDB

Although both MongoDB methods are big data processing queries, they share a lot of differences. Instead of retrieving data before performing calculations, which can be slower, both methods directly perform calculations on the data stored in the database, making queries more efficient.

However, one supersedes the other in performance, and you guessed right. The aggregation pipeline trumps MapReduce in efficiency and performance. But while you might want to replace MapReduce with the aggregation pipeline at all costs, there are still specific areas of application where using MapReduce makes more sense.

FAQ

Q: What Else Should I Know About the Aggregation Pipeline?

MongoDB's aggregation pipeline is a multi-step process that includes matching data, grouping it, and sorting it.

Q: What Queries and Commands Can I Use With MongoDB?

Although MongoDB is a NoSQL database, it still supports many of the operations you’ll be familiar with from traditional RDBMS programs.

Q: How Do the Map and Reduce Functions Work in JavaScript?

In JavaScript, map and reduce are methods of the Array class. They are higher-order functions that you can use to build new functions for highly flexible, reusable code.