MongoDB Schema Design: Embedding vs Referencing (2026)

Every MongoDB schema decision comes down to one fork: do you embed the related data as a subdocument or array inside one document, or do you reference it by storing its _id (or another key) and reading the related document separately? Embedding gives you a single-document read that is fast and atomic. Referencing keeps documents small and avoids duplicating data that lives in many places, at the cost of a second query or a $lookup join. Below is the decision table, the same relationship modeled both ways, a $lookup example, and the established patterns that sit between the two extremes.

Short answer: embed when the related data is small, bounded, and almost always read together with its parent (one-to-one or one-to-few). Reference when the relationship is one-to-many with high or unbounded cardinality, the related entity is large or shared across many parents, or it changes on its own and you do not want to duplicate it. The hard ceiling that forces a reference is the 16 MB BSON document limit: an unbounded embedded array eventually blows past it.

Embed or reference: the core decision

MongoDB stores data as BSON documents, and a document can nest other documents and arrays to any depth you like. That nesting is the whole point of the document model, so the first instinct should be to embed. The MongoDB manual puts it plainly: data that is accessed together should be stored together, and you model for your application's query patterns, not for a normalized textbook schema.

This is the opposite instinct from relational design, and it is worth saying out loud. In SQL you normalize first (split everything into its own table, join on read) and denormalize only when the joins hurt. In MongoDB you start denormalized (embed the related data in the document that reads it) and reach for references only when embedding stops paying off. If you carry relational habits into MongoDB you end up with a pile of tiny collections and a $lookup on every query, which is the slow path the document model was built to avoid.

Two mechanisms, two trade-offs:

Embedding nests the related data inside the parent document. One read returns everything. A single-document write is atomic in MongoDB, so updating the parent and its embedded children together is all-or-nothing with no transaction needed.
Referencing stores a pointer (_id, slug, or any key) and leaves the related data in its own collection. You resolve it with a second query or an aggregation $lookup. Documents stay small, shared data lives in one place, but you give up the single-read and the free atomicity, and MongoDB enforces no foreign keys, so referential integrity is your application's job.

When to embed

Embed when all of these hold, and the more that hold the stronger the case:

The related data is read together with the parent almost every time.
The relationship is one-to-one or one-to-few (a handful, not thousands).
The embedded data is bounded in size and does not grow without limit.
You do not need to query or update the child independently of the parent.

A user with one address, an order with its line items, a blog post with its tags: these are textbook embeds. You fetch the order, you get the line items in the same read, and you update them atomically.

A useful gut-check for the array size is the "rule of thousands": if the embedded array stays in the dozens or low hundreds and has a real upper bound, embedding is safe. Once it can run into the thousands with no ceiling, treat it as unbounded and reference (or bucket) it instead. The number is a heuristic, not a hard threshold; the document and array size, not the element count alone, is what actually matters.

javascript

// Order with embedded line items: one read, atomic writes
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f10001"),
  customerId: ObjectId("6512a1f0c3b9a2e4d8f10999"),
  status: "shipped",
  placedAt: ISODate("2026-05-28T14:02:00Z"),
  items: [
    { sku: "TE-MUG-01", name: "Field Notes Mug", qty: 2, unitPrice: 1400 },
    { sku: "TE-CAP-03", name: "Logo Cap",        qty: 1, unitPrice: 2200 }
  ],
  shippingAddress: {
    line1: "12 Bayswater Rd", city: "Colombo", postcode: "00300", country: "LK"
  }
}

A live mongosh session: mongosh, an embedded document, with the real inserted and returned document. — A live mongosh session in MongoDB 7: the real document as stored, not illustrative.

The line items have no meaning outside this order and the address is a single subdocument, so both belong inside the parent.

When to reference

Reference when embedding would force you to duplicate data, grow a document without bound, or drag along data you rarely need:

The relationship is one-to-many or many-to-many with high or unbounded cardinality.
The related entity is large, or shared across many parents (the same author on a thousand posts).
The related entity changes on its own and you do not want to update copies scattered across documents.

A blog post referencing an author is the canonical case. The author is one record shared by every post they wrote; if you embedded the full author object into each post, a name change means rewriting every post.

When you reference a one-to-many relationship, put the link on the many side: each post carries the authorId, each comment carries a postId. The parent stays small and fixed in size no matter how many children it gains, and you index the child collection on the parent key for fast lookups. An array of child _ids on the parent only works when that array is genuinely bounded, otherwise you have reinvented the unbounded-array problem the reference was meant to dodge.

javascript

// Post references its author by _id; author lives once in its own collection
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f20001"),
  title: "MongoDB Schema Design: Embedding vs Referencing",
  slug: "mongodb-embed-vs-reference-schema",
  authorId: ObjectId("6512a1f0c3b9a2e4d8f30007"),
  body: "Every MongoDB schema decision comes down to one fork...",
  publishedAt: ISODate("2019-11-19T09:00:00Z")
}

// authors collection
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f30007"),
  name: "Ishan Karunaratne",
  bio: "Tech architect, 20+ years shipping real systems.",
  twitter: "@techearl"
}

The 16 MB limit is the hard line

The constraint that turns "should I embed?" into "I cannot embed" is the 16 MB maximum BSON document size. Every document, with all its embedded subdocuments and arrays, has to fit under that ceiling. An unbounded embedded array (comments on a viral post, events on a busy device, "massive arrays" in MongoDB's own anti-pattern language) will eventually hit it.

Long before you reach 16 MB the performance cost shows up. A large document is more expensive to read, ship over the wire, and hold in the working set, and growing an array means MongoDB rewrites the whole document. So the rule for any unbounded one-to-many is: do not embed the many side without a cap. Reference it, or use one of the patterns below (subset, bucket) that embeds a bounded slice and references the rest.

Joining referenced data with $lookup

When you reference, you eventually need the related data back. The aggregation $lookup stage performs a left outer join from one collection into another:

javascript

// Join each post to its author document
db.posts.aggregate([
  { $match: { slug: "mongodb-embed-vs-reference-schema" } },
  {
    $lookup: {
      from: "authors",
      localField: "authorId",
      foreignField: "_id",
      as: "author"
    }
  },
  { $unwind: "$author" },               // turn the 1-element array into a subdocument
  { $project: { title: 1, "author.name": 1, "author.bio": 1 } }
])

$lookup writes the matched documents into an array field (author here), which is why a single-match join is usually followed by $unwind to flatten it. Two things to keep in mind: a $lookup is less efficient than an embedded read (it is a join, with all the work a join implies), and there is no enforced referential integrity, so a dangling authorId that points at a deleted author just returns an empty array. The application has to keep references honest.

Embed vs reference decision table

Factor	Lean embed	Lean reference
Relationship cardinality	One-to-one, one-to-few	One-to-many, many-to-many, unbounded
Access pattern	Parent and child read together	Child queried or used on its own
Data size / growth	Small, bounded	Large, or grows without limit
Update frequency	Child changes with the parent	Child changes independently
Atomicity	Need a single atomic write	Per-collection writes are fine
Shared across parents	No, owned by one parent	Yes, one entity many parents
Query cost	Single-document read (fast)	Extra query or `$lookup` join

No single row decides it. Weigh them together: a one-to-few relationship that is read together and updated together is an obvious embed even if the child could in theory be queried alone.

The same relationship, both ways

Take a blog post and its comments. Embedded, the comments live in the post and one read renders the whole page:

javascript

// Embedded: fine while comment count stays bounded
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f20001"),
  title: "MongoDB Schema Design: Embedding vs Referencing",
  body: "Every MongoDB schema decision...",
  comments: [
    { author: "dev_anna", text: "Saved me a refactor, thanks.", at: ISODate("2026-05-28T10:00:00Z") },
    { author: "kp_ops",   text: "The 16 MB note is the key bit.", at: ISODate("2026-05-28T11:30:00Z") }
  ]
}

That is the right model for a post that gets a few dozen comments. It is the wrong model for a post that might get fifty thousand, because the array is unbounded and marches toward the 16 MB wall. There, reference the comments into their own collection:

javascript

// Referenced: comments scale independently, queried and paged on their own
// posts
{ _id: ObjectId("6512a1f0c3b9a2e4d8f20001"), title: "MongoDB Schema Design: Embedding vs Referencing" }

// comments (one document per comment, indexed on postId)
{ _id: ObjectId("..."), postId: ObjectId("6512a1f0c3b9a2e4d8f20001"), author: "dev_anna", text: "Saved me a refactor, thanks.", at: ISODate("2026-05-28T10:00:00Z") }
{ _id: ObjectId("..."), postId: ObjectId("6512a1f0c3b9a2e4d8f20001"), author: "kp_ops",   text: "The 16 MB note is the key bit.",   at: ISODate("2026-05-28T11:30:00Z") }

Now comments page and sort on their own, and the post document stays tiny. The cost is the second query (or a $lookup) to pull comments when you render the page. Same relationship, two schemas, and the deciding factor is purely whether the comment count is bounded.

The patterns between the extremes

Embed-or-reference is not binary. A set of named patterns, popularized in MongoDB's own schema-design material, covers the common middle ground:

One-to-few: embed. A person's few addresses, a product's handful of variants. Just nest the array.
One-to-many: reference (or subset). Reference the many side into its own collection. If you almost always need only the most recent few, use the subset pattern: embed the hot slice (last 5 comments, top 3 reviews) in the parent and reference the full set elsewhere.
One-to-squillions: reference, often with the bucket pattern. For unbounded high-volume data like IoT readings or log lines, store one document per bucket of events (an hour of readings, say) rather than one per event, and reference the device. This keeps the document and index counts sane.
Extended reference: duplicate a few hot fields. When a join just to grab a name and avatar is wasteful, copy those one or two fields onto the referencing document alongside the _id. You accept a little duplication to skip the $lookup, and you update the copies when the source changes.
Computed pattern: store the rolled-up result. When reads recompute the same aggregate constantly (a running total, an average rating), compute it on write and store it, rather than recomputing on every read.

These are not exotic. Most production schemas are a mix: embed the one-to-few, reference the one-to-many, extended-reference the hot fields, and bucket the firehose.

What to do next

For picking an identifier type when you reference (ObjectId vs UUID), the trade-offs around storing a UUID in MongoDB are worth a read before you commit to a key.
For modeling currency on an embedded line item or a referenced product, the right way to store money in MongoDB (integer minor units or Decimal128, not floats) avoids a class of rounding bugs.

FAQ

Embed when the related data is small, bounded, and almost always read together with its parent (one-to-one or one-to-few). Reference when the relationship is one-to-many or many-to-many with high cardinality, the related entity is large or shared across many parents, or it changes independently and you do not want duplicate copies.

The guiding rule from MongoDB's data-modeling docs: data accessed together should be stored together, and you model for your application's query patterns, not for normalization.

A single BSON document cannot exceed 16 MB, including every embedded subdocument and array. This is the hard limit that rules out embedding an unbounded array: comments on a viral post, events on a busy sensor, anything that grows without a cap will eventually hit it. Those relationships must be referenced into their own collection (or bucketed).

Use the $lookup stage in an aggregation pipeline. It performs a left outer join from one collection into another, matching localField against foreignField and writing the matches into an array field you name with as. For a single-match join you usually follow it with $unwind to flatten that array into a subdocument. Note that a $lookup is slower than reading embedded data, and MongoDB enforces no foreign keys, so a reference to a deleted document just returns no matches.

No. MongoDB does not enforce referential integrity. If you store an authorId on a post and then delete that author, the post still holds the now-dangling reference, and a $lookup against it returns an empty result. Keeping references valid (cascading deletes, cleanup jobs, or validation on write) is the application's responsibility, not the database's.

For reads, usually yes: embedded data comes back in a single document read with no join, and a single-document write is atomic. But embedding stops paying off when the embedded data grows without bound (you risk the 16 MB limit and pay to rewrite a large document on every change) or when the same entity is duplicated across many parents (a single update becomes many). At that point a reference plus a $lookup is the faster and cleaner model overall.

Store the reference on the many side: each child holds the parent's _id (a postId on every comment, a customerId on every order). That keeps the parent document small and fixed in size no matter how many children exist, and it lets you index the child collection on the parent key for fast, paged lookups.

Putting an array of child _ids on the parent instead works only when that array is bounded, and it reintroduces the unbounded-array problem the reference was meant to avoid. The exception is the subset pattern: keep a short array of the hot few on the parent and the full set on the many side.

Duplicate a field when the read cost of joining for it outweighs the write cost of keeping copies in sync, and when that field rarely changes. The extended reference pattern is exactly this: copy one or two hot fields (an author name, a product title) onto the referencing document next to the _id, so the common read skips the $lookup. You take on the job of updating the copies when the source changes, which is cheap if the field is near-static and expensive if it churns. Do not duplicate volatile data; reference it.

It inverts the default. In a relational database you normalize first (one entity per table, join on read) and denormalize only when joins hurt. In MongoDB you start by embedding related data in the document that reads it, and reach for references only when cardinality, duplication, or the 16 MB limit force it. You model around how the application queries the data, not around a normalized schema.

MongoDB Schema Design: Embedding vs Referencing

Embed or reference: the core decision

When to embed

When to reference

The 16 MB limit is the hard line

Joining referenced data with $lookup

Embed vs reference decision table

The same relationship, both ways

The patterns between the extremes

What to do next

FAQ

See also

Sources

Ishan Karunaratne

Related posts

Managed WordPress Hosting vs VPS for Agencies

How to Build RAG with Embeddings and Vector Search

How to Use Capturing Groups and Backreferences in Regex

Should I embed or reference in MongoDB?

What is the maximum size of a MongoDB document?

How do I join two collections in MongoDB?

Does MongoDB enforce foreign keys between referenced documents?

Is embedding always faster than referencing?

When I reference a one-to-many relationship, which side stores the link?

When should I duplicate (denormalize) data across MongoDB documents?

How is MongoDB schema design different from relational design?

Sources

Ishan Karunaratne