Learnitweb

Protocol Buffers and Uniqueness in Collections

When developers start using repeated fields in Protocol Buffers, a very natural and frequently asked question appears sooner or later: “If repeated behaves like a list, what should I do when I want only unique items like a Set?”

This question is completely valid, but it reveals a deeper conceptual confusion about what Protocol Buffers is designed to do and what it is deliberately not designed to do. Therefore, before talking about uniqueness, we must first build a correct mental model of Protobuf itself, because once that model is clear, the answer becomes almost obvious.

Understanding the Real Purpose of Protobuf

Protocol Buffers is fundamentally a data serialization and data interchange format, which means its primary responsibility is to describe the structure of data that will be transmitted between systems, services, or components, especially in distributed and microservices architectures where efficient and structured communication is important.

In other words, a .proto file describes:

  • What data exists
  • What fields are present
  • What types those fields have
  • How that data can be serialized and deserialized

But a .proto file does not describe:

  • Business rules
  • Domain constraints
  • Validation logic
  • Collection semantics like “set” vs “list”
  • Behavioral rules about how data should be used

A very helpful way to think about Protobuf is to compare it to JSON, because both serve a very similar role in representing data for communication.

So the correct comparison is:

Protobuf  ↔  JSON

and not:

Protobuf  ↔  Java Collection Framework

Once you internalize this comparison, many doubts around repeated fields disappear.

How JSON Handles Collections (and Why That Matters)

Consider a typical JSON structure:

{
  "items": ["A", "B", "C", "A"]
}

Now ask yourself carefully:

  • Does JSON enforce uniqueness?
  • Does JSON distinguish between a List and a Set?
  • Does JSON prevent duplicates automatically?

The answer to all of these is no.

JSON arrays are simply ordered collections of values, and whether those values are unique or duplicated is entirely the responsibility of the application producing or consuming the data.

Protocol Buffers follows exactly the same philosophy, because it also focuses on representing data, not enforcing business rules.

What repeated Actually Means

When you define a field like this:

repeated string items = 1;

you are only declaring that this field can hold multiple values of the same type, and nothing more than that should be inferred from this declaration.

The repeated keyword does not imply:

  • Uniqueness
  • Deduplication
  • Special ordering guarantees
  • Set-like semantics
  • Validation rules

It simply means that multiple values can be stored and transmitted.

From Protobuf’s perspective, this is just a collection of values that will be serialized and sent over the wire, and how those values are chosen or filtered is outside its scope.

Where Uniqueness Should Be Enforced

If your system requires uniqueness, that requirement belongs to your application logic, not to the Protobuf schema, because uniqueness is a business or domain rule rather than a data representation rule.

For example, in Java you can enforce uniqueness before data even reaches Protobuf by using a Set:

Set<String> uniqueItems = new HashSet<>(List.of("A", "B", "C", "A"));

messageBuilder.addAllItems(uniqueItems);

Since addAllItems() accepts an Iterable, it does not care whether the source is a List, a Set, or any other iterable structure, and therefore if you pass a Set, duplicates are already removed before serialization happens.

This means uniqueness is handled at the application layer, which is exactly where such rules belong.

Why Protobuf Does Not Provide a Set Type

Many developers wonder why Protobuf does not include a native set type, and the reason is actually rooted in good design principles.

First, Protobuf is language-neutral, and not all programming languages implement sets in the same way or with the same guarantees, so enforcing set semantics at the schema level would introduce inconsistencies across languages.

Second, Protobuf deliberately avoids embedding business logic into the schema, because its role is to describe data structure, not behavioral constraints.

Third, keeping the model simple ensures portability, predictability, and performance, all of which are core goals of Protobuf’s design.

Practical Strategies for Handling Uniqueness

If uniqueness is important in your system, there are several sensible places to enforce it.

You can enforce uniqueness before building the message, which is often the cleanest approach because it guarantees clean data from the start.

You can validate and deduplicate before sending data to another service, which is useful when inputs come from multiple sources.

You can also validate after receiving the data, especially if you do not fully trust the sender, which is common in distributed systems.

In all of these scenarios, the responsibility stays with the application logic rather than the Protobuf definition.

The Correct Mental Model

A .proto file should be viewed as a contract for data exchange, similar to a JSON schema, where the goal is to define the shape of data and not the rules governing that data.

Therefore, comparing Protobuf to Java’s Queue, Stack, PriorityQueue, or Set leads to confusion, because those are in-memory data structures with behavioral semantics, while Protobuf is a transport format.

When you keep this distinction clear, your Protobuf designs become simpler, cleaner, and more maintainable.

Key Takeaways

A repeated field in Protobuf represents a collection of values in the same way a JSON array does, without implying any uniqueness or special semantics.

Uniqueness is a business rule and must be enforced by the application layer rather than the Protobuf schema.

If you want unique items, you can use a Set in your application before passing data to Protobuf.

Protobuf is about describing and transporting data, not about enforcing domain logic.