Skip to main content

String Encode and Decode

🔗 NeetCode Problem: https://neetcode.io/problems/encode-and-decode-strings 🔗 LeetCode Problem: https://leetcode.com/problems/encode-and-decode-strings

Problem Statement

Design an algorithm to encode a list of strings to a single string, and decode the encoded string back to the original list of strings. Constraints:
  • The string may contain any possible characters.
  • Strings can be empty.
  • You cannot use external libraries like base64 encoding.
Example:
  • Input: ["Hello","World"]
  • Encoded: 5#Hello5#World
  • Decoded: ["Hello","World"]

Approach 1: Brute Force

Idea

Use a simple delimiter (like comma) to separate strings. However, this fails if the delimiter appears in the string itself.

Pseudocode

encode(strs):
    result = ""
    for each str in strs:
        result += str + ","
    return result

decode(s):
    return s.split(",")

C++ Implementation

class Solution {
public:
    string encode(vector<string>& strs) {
        string result = "";
        for(string str : strs)
            result += str + ",";
        return result;
    }

    vector<string> decode(string s) {
        vector<string> result;
        string current = "";
        for(char c : s) {
            if(c == ',') {
                result.push_back(current);
                current = "";
            } else {
                current += c;
            }
        }
        return result;
    }
};

Complexity Analysis

  • Time Complexity: O(n) → Where n is the total number of characters across all strings
  • Space Complexity: O(n) → For storing the encoded result

Why it’s not optimal?

If any string contains the delimiter character (comma in this case), the decoding will break. For example:
  • Input: ["a,b", "c"]
  • Encoded: a,b,c,
  • Decoded: ["a", "b", "c"] ❌ (Wrong!)
This approach cannot handle strings with delimiters in them.

Optimized Approach: Length-Prefixed Encoding

Idea

Instead of using a delimiter, encode the length of each string followed by a separator, then the string itself. This way, we know exactly how many characters belong to each string, making it impossible for content to interfere with the format.

Key Insight

By storing the length of each string before the string itself, we can unambiguously decode any string, regardless of its content. The format becomes: length#string where # is a separator that guides our parsing, not a content delimiter.

How it works:

  1. Encoding: For each string, append its length, followed by #, followed by the string itself
  2. Decoding: Parse the length until we hit #, then read exactly that many characters for the string
  3. Repeat until we’ve decoded all strings

C++ Implementation

class Solution {
public:
    string encode(vector<string>& strs) {
        string result = "";
        for(string str : strs)
            result += to_string(str.size()) + "#" + str;
        return result;
    }

    vector<string> decode(string s) {
        vector<string> result;
        int i = 0;
        while(i < s.size()) {
            int j = i;
            // Find the '#' that separates length from string
            while(s[j] != '#')
                j++;
            
            // Extract the length
            int len = stoi(s.substr(i, j - i));
            
            // Extract the string of that length
            string word = s.substr(j + 1, len);
            result.push_back(word);
            
            // Move to next string
            i = j + 1 + len;
        }
        return result;
    }
};

Complexity Analysis

  • Time Complexity: O(n) → Where n is the total number of characters. We process each character exactly once.
  • Space Complexity: O(n) → For storing the encoded result (or decoded result in space-efficient implementation)

Why it’s optimal?

  1. Works with any character in the string, including #, commas, or any special character
  2. Linear time complexity - we never backtrack or revisit characters
  3. Linear space complexity - only storing the output
  4. Simple and elegant implementation
  5. No edge cases that break the algorithm

Comparison: Brute Force vs Optimized

ApproachTime ComplexitySpace ComplexityProsCons
Brute Force (Delimiter)O(n)O(n)Simple to understandFails with delimiter in string
Length-Prefixed EncodingO(n)O(n)Handles all characters, Unambiguous decodingSlightly more complex parsing

Why Length-Prefixed Encoding is Most Optimal

  1. Universal Delimiter Independence - No character can break the format since length dictates how many chars to read
  2. Linear Efficiency - Single pass through the data for both encoding and decoding
  3. Handles All Edge Cases - Empty strings, special characters, very long strings all work perfectly
  4. Deterministic Parsing - No ambiguity; each byte of input maps to a unique output
  5. Industry Standard - Used in protocols like gRPC, Protobuf, and network communication
  6. No External Dependencies - Pure algorithm, no external libraries needed

Example Walkthroughs

Example 1: Simple Strings

Input: ["Hello", "World"] Using Length-Prefixed Encoding:
Encoding:
- "Hello" has length 5 → "5#Hello"
- "World" has length 5 → "5#World"
- Result: "5#Hello5#World"

Decoding "5#Hello5#World":
- i=0: Find '#' at j=1
  - Length = 5
  - Read 5 chars from j+1: "Hello"
  - i = 1 + 1 + 5 = 7
- i=7: Find '#' at j=8
  - Length = 5
  - Read 5 chars from j+1: "World"
  - i = 8 + 1 + 5 = 14 (end)
- Result: ["Hello", "World"] ✓
Strings successfully encoded and decoded!

Example 2: Strings with Special Characters

Input: ["a#b", "c,d", "e"] Using Length-Prefixed Encoding:
Encoding:
- "a#b" has length 3 → "3#a#b"
- "c,d" has length 3 → "3#c,d"
- "e" has length 1 → "1#e"
- Result: "3#a#b3#c,d1#e"

Decoding "3#a#b3#c,d1#e":
- i=0: Find '#' at j=1
  - Length = 3
  - Read 3 chars: "a#b" (the # inside doesn't matter!)
  - i = 1 + 1 + 3 = 5
- i=5: Find '#' at j=6
  - Length = 3
  - Read 3 chars: "c,d" (the comma is just data)
  - i = 6 + 1 + 3 = 10
- i=10: Find '#' at j=11
  - Length = 1
  - Read 1 char: "e"
  - i = 11 + 1 + 1 = 13 (end)
- Result: ["a#b", "c,d", "e"] ✓
Notice how special characters don’t break the decoding!

Example 3: Empty Strings

Input: ["", "abc", ""] Using Length-Prefixed Encoding:
Encoding:
- "" has length 0 → "0#"
- "abc" has length 3 → "3#abc"
- "" has length 0 → "0#"
- Result: "0#3#abc0#"

Decoding "0#3#abc0#":
- i=0: Find '#' at j=1
  - Length = 0
  - Read 0 chars: ""
  - i = 1 + 1 + 0 = 2
- i=2: Find '#' at j=3
  - Length = 3
  - Read 3 chars: "abc"
  - i = 3 + 1 + 3 = 7
- i=7: Find '#' at j=8
  - Length = 0
  - Read 0 chars: ""
  - i = 8 + 1 + 0 = 9 (end)
- Result: ["", "abc", ""] ✓
Empty strings are handled gracefully!

Key Takeaways

  • Length-Prefixed Encoding is the standard solution for this problem, used in real-world protocols
  • Never rely solely on delimiters when the delimited content is untrusted or arbitrary
  • Separating metadata (length) from content allows safe serialization of any data
  • Linear time and space is optimal for serialization problems
  • Test edge cases: empty strings, special characters, very long strings
  • This pattern appears everywhere: network protocols, binary serialization, and data encoding