Using Regular Expressions to Parse PDF Test Data into JSON

Published in Technical Blog, 2025

Recommended citation: Amalie Shi. (2025). Using Regular Expressions to Parse PDF Test Data into JSON. Technical Blog.

πŸ” Using Regular Expressions to Parse PDF Test Data into JSON (with Data Validation)

Working with test reports stored in PDF files can be frustrating β€” especially when you need to extract structured data for automated processing. In this guide, we’ll walk through how to use Python’s regular expressions (re module) to:

  • Parse test results from PDF files using PyPDF2
  • Extract patterns using readable and maintainable named capture groups
  • Validate structured output using jsonschema
  • Understand common regex syntax, flags, and best practices

🧾 The Challenge: Test Reports in PDF Format

Imagine you’re given a batch of PDFs containing test results in this format:

Test1234    2025-05-10    PASS
Test1235    2025-05-11    FAIL

The goal is to extract these results into structured JSON. Regular expressions let us define patterns to find these entries reliably.


πŸ› οΈ Step 1: Extracting Text with PyPDF2

Use PyPDF2 to read PDF text:

import PyPDF2

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        return "\n".join([page.extract_text() for page in reader.pages])

🧠 Step 2: Using Regex with Named Groups for Readability

Use named capture groups ((?P<name>...)) to improve readability:

import re

pattern = re.compile(
    r'(?P<test_id>Test\d+)\s+(?P<date>\d{4}-\d{2}-\d{2})\s+(?P<result>PASS|FAIL)',
    re.MULTILINE | re.IGNORECASE
)

matches = pattern.finditer(text)
results = [match.groupdict() for match in matches]

πŸ” Step 3: Validate Extracted Data with jsonschema

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "test_id": {"type": "string"},
        "date": {"type": "string", "format": "date"},
        "result": {"type": "string", "enum": ["PASS", "FAIL"]}
    },
    "required": ["test_id", "date", "result"]
}

for result in results:
    validate(instance=result, schema=schema)

πŸ”€ Regex Syntax Reference

πŸ”  Basic Elements

PatternDescriptionExample Match
.Any character except newlinea.c β†’ abc, a-c
\dDigit (0–9)\d+ β†’ 123
\DNon-digit\D+ β†’ abc
\wWord character (a-zA-Z0-9_)\w+ β†’ hello123
\WNon-word character\W β†’ #, .
\sWhitespace (space, tab, newline)\s β†’ ' ', \n
\SNon-whitespace\S β†’ a, 1, _
[]Character set[aeiou] β†’ a, e
[^]Negated set[^0-9] β†’ a, B
^Start of line or string^Hello matches Hello
$End of line or stringworld$ matches world
\bWord boundary\bword\b matches word only

πŸ” Quantifiers

PatternDescriptionExample Match
*0 or more repetitionslo*l β†’ ll, lool
+1 or more repetitionslo+l β†’ lol, lool
?0 or 1 repetitioncolou?r β†’ color, colour
{m}Exactly m repetitionsa{3} β†’ aaa
{m,n}Between m and n repetitionsa{2,4} β†’ aa, aaa

🧩 Grouping and Capturing

PatternDescriptionExample
(...)Capture group(abc)+ β†’ captures abc
(?:...)Non-capturing group(?:abc)+
(?P<name>...)Named capturing group(?P<year>\d{4})
(?P=name)Backreference to named groupΒ 

βš™οΈ Common Regex Functions in Python

FunctionPurpose
re.findall()Return all matches
re.search()Return first match
re.match()Match from beginning of string
re.sub()Substitute matched string
re.split()Split string on pattern
re.compile()Compile regex with optional flags
re.escape()Escape regex special characters

🚩 Regex Flags

FlagDescription
re.IGNORECASECase-insensitive matching
re.MULTILINE^ and $ match line boundaries
re.DOTALL. matches newline too
re.VERBOSEAllows whitespace and comments in pattern

πŸ’‘ Tips for Readable Regex

  • Use re.compile() for clarity and reuse.
  • Label parts of your regex with (?P<name>...) to make results self-descriptive.
  • Use re.escape() when searching for literal strings that may include regex characters:
term = "Test(123)"
escaped = re.escape(term)
pattern = re.compile(escaped)

βœ… Conclusion

By combining PDF parsing, regex with named groups, and JSON Schema validation, you can automate test data extraction and verification cleanly and reliably.

Regex can be intimidating at first, but with practice and smart organization β€” especially using named groups and clear structure β€” it becomes a powerful ally in any data extraction task.