String Encode and Decode
🔗 NeetCode Problem: https://neetcode.io/problems/encode-and-decode-strings 🔗 LeetCode Problem: https://leetcode.com/problems/encode-and-decode-stringsProblem Statement
Design an algorithm to encode a list of strings to a single string, and decode the encoded string back to the original list of strings. Constraints:- The string may contain any possible characters.
- Strings can be empty.
- You cannot use external libraries like base64 encoding.
- Input:
["Hello","World"] - Encoded:
5#Hello5#World - Decoded:
["Hello","World"]
Approach 1: Brute Force
Idea
Use a simple delimiter (like comma) to separate strings. However, this fails if the delimiter appears in the string itself.Pseudocode
C++ Implementation
Complexity Analysis
- Time Complexity: O(n) → Where n is the total number of characters across all strings
- Space Complexity: O(n) → For storing the encoded result
Why it’s not optimal?
If any string contains the delimiter character (comma in this case), the decoding will break. For example:- Input:
["a,b", "c"] - Encoded:
a,b,c, - Decoded:
["a", "b", "c"]❌ (Wrong!)
Optimized Approach: Length-Prefixed Encoding
Idea
Instead of using a delimiter, encode the length of each string followed by a separator, then the string itself. This way, we know exactly how many characters belong to each string, making it impossible for content to interfere with the format.Key Insight
By storing the length of each string before the string itself, we can unambiguously decode any string, regardless of its content. The format becomes:length#string where # is a separator that guides our parsing, not a content delimiter.
How it works:
- Encoding: For each string, append its length, followed by
#, followed by the string itself - Decoding: Parse the length until we hit
#, then read exactly that many characters for the string - Repeat until we’ve decoded all strings
C++ Implementation
Complexity Analysis
- Time Complexity: O(n) → Where n is the total number of characters. We process each character exactly once.
- Space Complexity: O(n) → For storing the encoded result (or decoded result in space-efficient implementation)
Why it’s optimal?
- Works with any character in the string, including
#, commas, or any special character - Linear time complexity - we never backtrack or revisit characters
- Linear space complexity - only storing the output
- Simple and elegant implementation
- No edge cases that break the algorithm
Comparison: Brute Force vs Optimized
| Approach | Time Complexity | Space Complexity | Pros | Cons |
|---|---|---|---|---|
| Brute Force (Delimiter) | O(n) | O(n) | Simple to understand | Fails with delimiter in string |
| Length-Prefixed Encoding | O(n) | O(n) | Handles all characters, Unambiguous decoding | Slightly more complex parsing |
Why Length-Prefixed Encoding is Most Optimal
- Universal Delimiter Independence - No character can break the format since length dictates how many chars to read
- Linear Efficiency - Single pass through the data for both encoding and decoding
- Handles All Edge Cases - Empty strings, special characters, very long strings all work perfectly
- Deterministic Parsing - No ambiguity; each byte of input maps to a unique output
- Industry Standard - Used in protocols like gRPC, Protobuf, and network communication
- No External Dependencies - Pure algorithm, no external libraries needed
Example Walkthroughs
Example 1: Simple Strings
Input:["Hello", "World"]
Using Length-Prefixed Encoding:
Example 2: Strings with Special Characters
Input:["a#b", "c,d", "e"]
Using Length-Prefixed Encoding:
Example 3: Empty Strings
Input:["", "abc", ""]
Using Length-Prefixed Encoding:
Key Takeaways
- Length-Prefixed Encoding is the standard solution for this problem, used in real-world protocols
- Never rely solely on delimiters when the delimited content is untrusted or arbitrary
- Separating metadata (length) from content allows safe serialization of any data
- Linear time and space is optimal for serialization problems
- Test edge cases: empty strings, special characters, very long strings
- This pattern appears everywhere: network protocols, binary serialization, and data encoding