TechEarl

MongoDB Schema Design: Embedding vs Referencing

When to embed related data as subdocuments and when to reference it by _id in MongoDB. The 16 MB document limit, $lookup joins, access patterns, and the established schema-design patterns.

Ishan Karunaratne⏱️ 12 min readUpdated
Share thisCopied
MongoDB schema design decision: embed related data as subdocuments for one-to-few relationships accessed together, or reference it by _id and join with $lookup for high-cardinality, independently-updated, or shared data, bounded by the 16 MB document limit.

Every MongoDB schema decision comes down to one fork: do you embed the related data as a subdocument or array inside one document, or do you reference it by storing its _id (or another key) and reading the related document separately? Embedding gives you a single-document read that is fast and atomic. Referencing keeps documents small and avoids duplicating data that lives in many places, at the cost of a second query or a $lookup join. Below is the decision table, the same relationship modeled both ways, a $lookup example, and the established patterns that sit between the two extremes.

Short answer: embed when the related data is small, bounded, and almost always read together with its parent (one-to-one or one-to-few). Reference when the relationship is one-to-many with high or unbounded cardinality, the related entity is large or shared across many parents, or it changes on its own and you do not want to duplicate it. The hard ceiling that forces a reference is the 16 MB BSON document limit: an unbounded embedded array eventually blows past it.

Embed or reference: the core decision

MongoDB stores data as BSON documents, and a document can nest other documents and arrays to any depth you like. That nesting is the whole point of the document model, so the first instinct should be to embed. The MongoDB manual puts it plainly: data that is accessed together should be stored together, and you model for your application's query patterns, not for a normalized textbook schema.

This is the opposite instinct from relational design, and it is worth saying out loud. In SQL you normalize first (split everything into its own table, join on read) and denormalize only when the joins hurt. In MongoDB you start denormalized (embed the related data in the document that reads it) and reach for references only when embedding stops paying off. If you carry relational habits into MongoDB you end up with a pile of tiny collections and a $lookup on every query, which is the slow path the document model was built to avoid.

Two mechanisms, two trade-offs:

  • Embedding nests the related data inside the parent document. One read returns everything. A single-document write is atomic in MongoDB, so updating the parent and its embedded children together is all-or-nothing with no transaction needed.
  • Referencing stores a pointer (_id, slug, or any key) and leaves the related data in its own collection. You resolve it with a second query or an aggregation $lookup. Documents stay small, shared data lives in one place, but you give up the single-read and the free atomicity, and MongoDB enforces no foreign keys, so referential integrity is your application's job.

When to embed

Embed when all of these hold, and the more that hold the stronger the case:

  • The related data is read together with the parent almost every time.
  • The relationship is one-to-one or one-to-few (a handful, not thousands).
  • The embedded data is bounded in size and does not grow without limit.
  • You do not need to query or update the child independently of the parent.

A user with one address, an order with its line items, a blog post with its tags: these are textbook embeds. You fetch the order, you get the line items in the same read, and you update them atomically.

A useful gut-check for the array size is the "rule of thousands": if the embedded array stays in the dozens or low hundreds and has a real upper bound, embedding is safe. Once it can run into the thousands with no ceiling, treat it as unbounded and reference (or bucket) it instead. The number is a heuristic, not a hard threshold; the document and array size, not the element count alone, is what actually matters.

javascript
// Order with embedded line items: one read, atomic writes
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f10001"),
  customerId: ObjectId("6512a1f0c3b9a2e4d8f10999"),
  status: "shipped",
  placedAt: ISODate("2026-05-28T14:02:00Z"),
  items: [
    { sku: "TE-MUG-01", name: "Field Notes Mug", qty: 2, unitPrice: 1400 },
    { sku: "TE-CAP-03", name: "Logo Cap",        qty: 1, unitPrice: 2200 }
  ],
  shippingAddress: {
    line1: "12 Bayswater Rd", city: "Colombo", postcode: "00300", country: "LK"
  }
}

The line items have no meaning outside this order and the address is a single subdocument, so both belong inside the parent.

When to reference

Reference when embedding would force you to duplicate data, grow a document without bound, or drag along data you rarely need:

  • The relationship is one-to-many or many-to-many with high or unbounded cardinality.
  • The related entity is large, or shared across many parents (the same author on a thousand posts).
  • The related entity changes on its own and you do not want to update copies scattered across documents.

A blog post referencing an author is the canonical case. The author is one record shared by every post they wrote; if you embedded the full author object into each post, a name change means rewriting every post.

javascript
// Post references its author by _id; author lives once in its own collection
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f20001"),
  title: "MongoDB Schema Design: Embedding vs Referencing",
  slug: "mongodb-embed-vs-reference-schema",
  authorId: ObjectId("6512a1f0c3b9a2e4d8f30007"),
  body: "Every MongoDB schema decision comes down to one fork...",
  publishedAt: ISODate("2019-11-19T09:00:00Z")
}

// authors collection
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f30007"),
  name: "Ishan Karunaratne",
  bio: "Tech architect, 20+ years shipping real systems.",
  twitter: "@techearl"
}

The 16 MB limit is the hard line

The constraint that turns "should I embed?" into "I cannot embed" is the 16 MB maximum BSON document size. Every document, with all its embedded subdocuments and arrays, has to fit under that ceiling. An unbounded embedded array (comments on a viral post, events on a busy device, "massive arrays" in MongoDB's own anti-pattern language) will eventually hit it.

Long before you reach 16 MB the performance cost shows up. A large document is more expensive to read, ship over the wire, and hold in the working set, and growing an array means MongoDB rewrites the whole document. So the rule for any unbounded one-to-many is: do not embed the many side without a cap. Reference it, or use one of the patterns below (subset, bucket) that embeds a bounded slice and references the rest.

Joining referenced data with $lookup

When you reference, you eventually need the related data back. The aggregation $lookup stage performs a left outer join from one collection into another:

javascript
// Join each post to its author document
db.posts.aggregate([
  { $match: { slug: "mongodb-embed-vs-reference-schema" } },
  {
    $lookup: {
      from: "authors",
      localField: "authorId",
      foreignField: "_id",
      as: "author"
    }
  },
  { $unwind: "$author" },               // turn the 1-element array into a subdocument
  { $project: { title: 1, "author.name": 1, "author.bio": 1 } }
])

$lookup writes the matched documents into an array field (author here), which is why a single-match join is usually followed by $unwind to flatten it. Two things to keep in mind: a $lookup is less efficient than an embedded read (it is a join, with all the work a join implies), and there is no enforced referential integrity, so a dangling authorId that points at a deleted author just returns an empty array. The application has to keep references honest.

Embed vs reference decision table

FactorLean embedLean reference
Relationship cardinalityOne-to-one, one-to-fewOne-to-many, many-to-many, unbounded
Access patternParent and child read togetherChild queried or used on its own
Data size / growthSmall, boundedLarge, or grows without limit
Update frequencyChild changes with the parentChild changes independently
AtomicityNeed a single atomic writePer-collection writes are fine
Shared across parentsNo, owned by one parentYes, one entity many parents
Query costSingle-document read (fast)Extra query or $lookup join

No single row decides it. Weigh them together: a one-to-few relationship that is read together and updated together is an obvious embed even if the child could in theory be queried alone.

The same relationship, both ways

Take a blog post and its comments. Embedded, the comments live in the post and one read renders the whole page:

javascript
// Embedded: fine while comment count stays bounded
{
  _id: ObjectId("6512a1f0c3b9a2e4d8f20001"),
  title: "MongoDB Schema Design: Embedding vs Referencing",
  body: "Every MongoDB schema decision...",
  comments: [
    { author: "dev_anna", text: "Saved me a refactor, thanks.", at: ISODate("2026-05-28T10:00:00Z") },
    { author: "kp_ops",   text: "The 16 MB note is the key bit.", at: ISODate("2026-05-28T11:30:00Z") }
  ]
}

That is the right model for a post that gets a few dozen comments. It is the wrong model for a post that might get fifty thousand, because the array is unbounded and marches toward the 16 MB wall. There, reference the comments into their own collection:

javascript
// Referenced: comments scale independently, queried and paged on their own
// posts
{ _id: ObjectId("6512a1f0c3b9a2e4d8f20001"), title: "MongoDB Schema Design: Embedding vs Referencing" }

// comments (one document per comment, indexed on postId)
{ _id: ObjectId("..."), postId: ObjectId("6512a1f0c3b9a2e4d8f20001"), author: "dev_anna", text: "Saved me a refactor, thanks.", at: ISODate("2026-05-28T10:00:00Z") }
{ _id: ObjectId("..."), postId: ObjectId("6512a1f0c3b9a2e4d8f20001"), author: "kp_ops",   text: "The 16 MB note is the key bit.",   at: ISODate("2026-05-28T11:30:00Z") }

Now comments page and sort on their own, and the post document stays tiny. The cost is the second query (or a $lookup) to pull comments when you render the page. Same relationship, two schemas, and the deciding factor is purely whether the comment count is bounded.

The patterns between the extremes

Embed-or-reference is not binary. A set of named patterns, popularized in MongoDB's own schema-design material, covers the common middle ground:

  • One-to-few: embed. A person's few addresses, a product's handful of variants. Just nest the array.
  • One-to-many: reference (or subset). Reference the many side into its own collection. If you almost always need only the most recent few, use the subset pattern: embed the hot slice (last 5 comments, top 3 reviews) in the parent and reference the full set elsewhere.
  • One-to-squillions: reference, often with the bucket pattern. For unbounded high-volume data like IoT readings or log lines, store one document per bucket of events (an hour of readings, say) rather than one per event, and reference the device. This keeps the document and index counts sane.
  • Extended reference: duplicate a few hot fields. When a join just to grab a name and avatar is wasteful, copy those one or two fields onto the referencing document alongside the _id. You accept a little duplication to skip the $lookup, and you update the copies when the source changes.
  • Computed pattern: store the rolled-up result. When reads recompute the same aggregate constantly (a running total, an average rating), compute it on write and store it, rather than recomputing on every read.

These are not exotic. Most production schemas are a mix: embed the one-to-few, reference the one-to-many, extended-reference the hot fields, and bucket the firehose.

What to do next

  • For picking an identifier type when you reference (ObjectId vs UUID), the trade-offs around storing a UUID in MongoDB are worth a read before you commit to a key.
  • For modeling currency on an embedded line item or a referenced product, the right way to store money in MongoDB (integer minor units or Decimal128, not floats) avoids a class of rounding bugs.

FAQ

See also

Sources

Authoritative references this article was fact-checked against.

TagsMongoDBSchema DesignDocument ModelingEmbeddingReferencingNoSQLDatabase Design

Found this useful? Pass it on.

Copied

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years building software, Linux systems, and DevOps infrastructure, and lately working AI into the stack. Currently Chief Technology Officer at a healthcare tech startup, which is where most of these field notes come from.

Keep reading

Related posts

Managed WordPress Hosting vs VPS for Agencies

Managed WordPress hosting buys you operational simplicity at a per-site price premium. VPS buys you flexibility and lower per-resource cost at the price of in-house sysadmin time. The honest comparison and the agency-side break-even math.