Back to blog
|David Durika

Embedding vs Referencing in MongoDB: When to Use Which

The #1 MongoDB schema design question, answered. Learn when to embed documents, when to use references, and how to avoid the common mistakes that lead to performance problems and 16MB document limits.

Embedding vs Referencing in MongoDB: When to Use Which

The Schema Design Question That Never Goes Away

If you've spent any time with MongoDB, you've hit this wall: should I embed this data inside the document or store it separately and reference it?

Developers coming from SQL default to normalizing everything into separate collections with references. Developers who've fully bought into MongoDB's document model embed everything and eventually hit the 16MB document limit at 3am. Neither extreme is right, and the answer is always "it depends on your access patterns." Which is true but unhelpful without specifics.

So let's get specific.

A Quick Refresher

If you're clear on what embedding and referencing look like in MongoDB, skip ahead. For everyone else, here's the short version.

Embedding

You store related data directly inside the parent document:

// An order with embedded line items
{
  "_id": ObjectId("..."),
  "customer": "alice",
  "date": ISODate("2026-02-20"),
  "items": [
    { "product": "Mechanical Keyboard", "price": 149.99, "qty": 1 },
    { "product": "USB-C Cable", "price": 12.99, "qty": 2 }
  ],
  "total": 175.97
}

Everything lives in one document. One read gets you the full order.

Referencing

You store related data in a separate collection and link them by ID:

// orders collection
{
  "_id": ObjectId("order_1"),
  "customer": "alice",
  "date": ISODate("2026-02-20"),
  "total": 175.97
}

// order_items collection
{
  "_id": ObjectId("item_1"),
  "orderId": ObjectId("order_1"),
  "product": "Mechanical Keyboard",
  "price": 149.99,
  "qty": 1
}
{
  "_id": ObjectId("item_2"),
  "orderId": ObjectId("order_1"),
  "product": "USB-C Cable",
  "price": 12.99,
  "qty": 2
}

Two queries (or a $lookup) to get the full picture. More flexibility, more round-trips.

When to Embed

Embedding is MongoDB's default strength. Use it when:

The relationship is one-to-few

A user has a shipping address. A product has a list of specs. A blog post has a set of tags. These are small, bounded sets of data that belong to the parent and rarely stand alone.

{
  "name": "Alice",
  "addresses": [
    { "label": "Home", "city": "Prague", "street": "Vinohradská 12" },
    { "label": "Work", "city": "Prague", "street": "Karlova 8" }
  ]
}

Nobody queries addresses independently. They're always accessed with the user. Embed them.

You always read the data together

If your application loads an order and always needs the line items, embedding saves you a second query. This is the access pattern argument, and it's the most important factor in the decision.

Think about it from the read side: if you'd always $lookup the related data anyway, you're paying for the join on every read. Just embed it.

The data doesn't change independently

Embedded data gets updated through its parent document. If the child data is mostly written once and read many times (order line items, event logs, form submissions), embedding works cleanly. If the child data changes frequently on its own, updates get awkward.

When to Reference

Referencing adds complexity but solves real problems. Use it when:

The relationship is one-to-many or many-to-many

A blog post might have 5 comments or 5,000. An author writes for multiple publications. A product appears in multiple orders. These are cases where embedding creates unbounded growth or data duplication.

// comments collection - referenced by postId
{
  "_id": ObjectId("..."),
  "postId": ObjectId("post_1"),
  "author": "bob",
  "body": "Great article!",
  "date": ISODate("2026-02-25")
}

Keeping comments in their own collection means the post document stays lean regardless of how popular it gets.

The sub-documents are large or growing

If your embedded array can grow without a clear upper bound, you're on a path to the 16MB document limit. Social media posts with comments, IoT devices with sensor readings, users with activity logs: these are all cases where the child data can grow indefinitely. Reference them.

The data is shared across multiple parents

A tag applied to 500 blog posts. A product in 200 orders. If you embed shared data, you store duplicates everywhere, and updating it means updating every parent document that contains a copy. References give you a single source of truth.

The data needs independent updates

If comments need moderation, if inventory counts change every minute, if a referenced entity has its own lifecycle separate from the parent, references make updates simpler. You update one document in one collection instead of reaching into a nested array inside a parent.

The Hybrid Approach: Partial Embedding

This is where experienced MongoDB developers land, and it's underused. You embed a summary and reference the full data.

// orders collection - with a partial embed of product info
{
  "_id": ObjectId("..."),
  "customer": "alice",
  "items": [
    {
      "productId": ObjectId("prod_1"),
      "name": "Mechanical Keyboard",  // denormalized summary
      "price": 149.99,                // snapshot at time of order
      "qty": 1
    }
  ],
  "total": 149.99
}

The order has enough product info to display without a join, but the productId reference is there when you need the full product details. This pattern works because the embedded data is a snapshot: the product name and price at the time of purchase, which shouldn't change even if the product itself gets updated later.

Other examples of partial embedding:

  • User profiles in comments: Embed { userId, displayName, avatar } instead of the full user document
  • Category breadcrumbs in products: Embed { categoryId, name, path } for display
  • Latest N items: Embed the 5 most recent comments, reference the rest

Real-World Decision Matrix

Here's how this plays out in practice:

E-commerce: Orders and Line Items

Embed. An order's line items are created once, read together, and don't change. They're bounded (nobody has 10,000 items in a single order). This is textbook embedding.

Blog: Posts and Comments

Reference (or hybrid). Comments can be unbounded, they're often loaded separately (paginated), and they have their own lifecycle (moderation, editing, deletion). Embed a comment count on the post for display, reference the actual comments.

Social: Users and Followers

Reference. A user can have millions of followers. Embedding them would obliterate the document size limit. Store follow relationships in a separate collection with indexes on both followerId and followingId.

CMS: Articles and Authors

Reference (with partial embed). An author writes many articles. Store the full author in an authors collection, embed { authorId, name } on each article for display.

Performance Implications

The tradeoff is straightforward:

Embedding means fewer queries but larger documents. Reads are fast because everything is in one place. Writes to embedded arrays can be slower, especially with $push on large arrays. Document growth can cause storage fragmentation.

Referencing means more queries but smaller, focused documents. Updates are simpler. Documents don't grow unexpectedly. But you pay for the join, either in application code or with $lookup, which isn't free.

For read-heavy workloads where related data is always accessed together, embedding usually wins. For write-heavy workloads with independently evolving data, referencing is cleaner.

If you're unsure where your bottleneck actually is, tools like Mingo help you explore your actual document structures and see what's happening in your collections. You can spot bloated documents, inspect nested arrays that have grown out of control, and understand whether your current schema is actually matching your access patterns.

Common Mistakes

Embedding unbounded arrays

The most frequent mistake. An array that can grow without limit will eventually hit the 16MB ceiling, and performance degrades long before that. If you can't put a reasonable upper bound on an array, don't embed it.

Referencing everything like SQL

Coming from PostgreSQL and normalizing your MongoDB schema into 15 collections with foreign keys everywhere defeats the purpose. You end up with the join overhead of SQL without the join optimization of a relational database. MongoDB's $lookup is not a SQL JOIN. It's slower, less flexible, and should be the exception rather than the rule.

Ignoring access patterns

This is the root cause of most schema problems. People design their schema based on the data structure rather than how the application reads and writes it. Start with your queries: what data do you need together? How often? That drives the schema, not the entity-relationship diagram.

Never revisiting the decision

Your access patterns change as your product evolves. A blog post with 10 comments that used to work fine as an embedded array now has posts with 2,000 comments. Mingo makes it easy to periodically check your collections and see whether your original schema decisions still hold up, before they become production incidents.

Making the Decision

When you're staring at a new relationship and wondering "embed or reference?", walk through these questions:

  1. How many related items will there be? Fewer than ~100 with a clear cap? Embed. Unbounded? Reference.
  2. Do you always read them together? Yes? Embed. Sometimes or rarely? Reference.
  3. Does the child data change independently? Rarely? Embed. Frequently? Reference.
  4. Is the data shared across multiple parents? No? Embed. Yes? Reference.
  5. Could the embedded data grow large enough to matter? No? Embed. Maybe? Hybrid or reference.

If you get mixed signals, lean toward the hybrid approach. Embed the summary data you need for display, keep a reference for full access, and revisit as your application evolves.

There's no permanent right answer here, only the right answer for your current access patterns. The important thing is to make a deliberate choice, understand the tradeoffs, and keep an eye on how your data actually behaves in production.