Skip to content

Extract JSON Objects from Bulk Text

Parsing JSON objects from noisy or mixed-content text (e.g., LLM responses) is easy with this utility function.

Dependencies

pip install jsonfinder

⚙️ Function Definition

utils.py
import re
import jsonfinder

def extract_json_objects(text: str, sanitize_text: bool = False):
    """
    Extracts valid JSON objects from a text string.

    Args:
        text (str): Input text that may contain embedded JSON objects.
        sanitize_text (bool): If True, removes control characters often introduced by LLMs (e.g., Gemini).

    Yields:
        Any: Parsed JSON objects found in the text.

    Examples:
        >>> list(extract_json_objects('Text: {"a": 1}'))
        [{'a': 1}]
    """
    if sanitize_text:
        text = re.sub(r'[\x00-\x08\x0B\x0C\x0E-\x1F]+', '', text)

    for i in jsonfinder.jsonfinder(text):
        if i[2] is not None:
            yield i[2]

✅ When to Use

  • When parsing JSON embedded in LLM outputs.
  • When processing logs, responses, or raw strings that contain but are not purely JSON
  • When cleaning or extracting structured data from mixed-format text blobs

📝 Usage Example

1. Basic JSON object extraction

text = 'Text before {"key": "value"} text after'
result = next(extract_json_objects(text))
"""
{"key": "value"}
"""

2. Multiple JSON objects in one string

text = '{"a": 1} some text {"b": 2}'
result = list(extract_json_objects(text))
"""
[{"a": 1}, {"b": 2}]
"""

3. Malformed JSON should not be returned

text = 'Invalid: {key: 1}, valid: {"c": 3}, and valid: {"q": "b"}'
for json_obj in extract_json_objects(text):
    print(json_obj)
"""
{"c": 3}
{"q": "b"}
"""

4. Input with invalid control characters, no cleaning

text = '{"d": 4\x01}'
result = next(extract_json_objects(text), None)
"""
None
"""

5. Input with control characters and cleaning enabled

text = '{"d": 4\x01}'
result = next(extract_json_objects(text, sanitize_text=True), None)
"""
{"d": 4}
"""

6. No JSON present

text = 'Just some plain text.'
result = list(extract_json_objects(text))
"""
[]
"""

7. Nested JSON object

text = 'Here is a nested one: {"outer": {"inner": "value"}}'
result = next(extract_json_objects(text))
"""
{"outer": {"inner": "value"}}
"""

8. JSON array

text = 'This is an array: [{"x": 1}, {"y": 2}]'
result = list(extract_json_objects(text))
"""
[
    [{"x": 1}, {"y": 2}]
]
"""