Generate
Back to Blog
Terminal window showing a development environment populated with synthetic user data

A fresh development database starts empty. The developer runs a seed script, and ten minutes later the users table contains fifty rows of "John Doe" at "123 Main Street" with the email "test1@test.com" through "test50@test.com." The application looks functional. Forms render, dashboards populate, and the test suite passes. Then the application reaches staging, and a QA engineer enters a name with an accented character. The profile page breaks because the template assumed ASCII. The address field truncates a German street name because someone hardcoded a 30-character limit. The phone validation rejects a UK number because it was tested exclusively with US formats.

Every one of these bugs existed in the development environment. The placeholder data just couldn't trigger them. "John Doe" doesn't test Unicode handling. "123 Main Street" doesn't test address-field length. "555-0100" doesn't test international phone parsing. The seed data was optimised for developer convenience rather than for exposing the application's actual behaviour under realistic conditions. That gap between what the seed data exercises and what production traffic exercises is where bugs live undiscovered until they cost the most to fix.

The Categories of Bugs That Placeholder Data Hides

Character encoding issues are the most common. Names with accents (Renée, Müller, Søren), names with apostrophes (O'Brien, D'Angelo), names with non-Latin characters (Japanese kanji, Arabic script, Cyrillic), and names with diacritics that look identical in some fonts but are different Unicode codepoints. A database that contains only ASCII names will never surface a truncation bug in a VARCHAR column that's too short, a rendering issue in a template that doesn't handle multibyte characters, or a search function that can't match accented characters against their unaccented equivalents.

Layout bugs are the second category. A dashboard that looks correct with five-character first names breaks when someone named "Bartholomew Worthington-Smythe III" appears. An address card that fits neatly with "123 Oak St" overflows when it encounters "Friedrichswerdersche Kirchstraße 42, Hinterhaus, 3. Obergeschoss" in Berlin. Price displays that work for "$9.99" break when the currency symbol goes after the number (9,99 EUR) or uses a different decimal separator. These are CSS and template bugs, but they're invisible when every record in the database uses short, American-formatted strings.

Validation logic bugs emerge when input data varies. A phone field that works for 10-digit US numbers rejects UK numbers (which can be 10 or 11 digits depending on the area code). A postal code field that expects five digits fails silently for Canadian alphanumeric codes (A1A 1A1) or UK postcodes with variable lengths (E1 6AN vs EC1A 1BB). An age calculation that works for birth dates in the 1990s produces wrong results for dates before 1970 because of Unix timestamp edge cases.

Query performance issues only appear at scale. A query that executes in 2ms against fifty rows might take 30 seconds against fifty thousand. Index effectiveness, join strategies, and pagination logic can't be evaluated with a toy dataset. The developer who profiled the application against placeholder data gets a false sense of performance that evaporates the moment real traffic hits the system.

Building a Seed Dataset

A useful seed dataset isn't random. It's designed to exercise the application's constraints. The records should include a mix of countries (to test internationalisation), a range of name lengths (to test layout boundaries), various special characters (to test encoding), and enough volume to reveal performance characteristics.

The minimum useful size depends on the application. For a CRUD app with simple views, a few hundred records might suffice. For an application with search, filtering, reporting, or analytics features, the dataset should be large enough to produce meaningful results in those contexts. A search engine that searches three records doesn't reveal anything about relevance ranking. A reporting dashboard with five users doesn't show whether the charts handle overlapping labels or whether the pagination controls behave correctly when there are multiple pages of results.

The data needs to be internally consistent. A record with a German name, a German address, a German phone number, and a German postal code will exercise the application differently than a record where those fields come from different countries. The German address uses "Straße" in the street field, places the house number after the street name, and uses a five-digit postal code. A generator that assembles random fields from different locales produces records that test nothing useful because no real user would produce that combination.

Generators like Faker (the Python/Ruby/JS library) can generate locale-specific fields individually, but combining them into consistent records requires additional logic to ensure the fields agree with each other. Tools like Another.IO take a different approach and produce internally consistent profiles by country: the name matches the locale's naming conventions, the address follows the country's format, the phone number uses the correct prefix, and the postal code matches the declared city.

Integrating Synthetic Data into the Development Workflow

The seed script should run as part of the development environment setup. A new developer clones the repository, runs the setup command, and gets a populated database without needing access to production data, a database dump, or manual data entry. The fewer steps between "clone" and "working app," the faster onboarding goes.

Django and Rails both support fixture-based seeding, but fixtures are static. A JSON or YAML file with 500 records is large, hard to review in pull requests, and impossible to parameterise. A seed script that programmatically generates records is more flexible: it can accept a count parameter, can be re-run with different seeds to produce different datasets, and can be updated when the schema changes without manually editing hundreds of fixture entries.

Deterministic seeding is worth implementing. If the seed script uses a fixed random seed, it generates the same dataset every time. This makes tests reproducible: a test that fails against a specific record can be debugged because re-running the seed produces the same records. Rotating the seed periodically introduces new data patterns and catches bugs that only appear with specific combinations.

The seed script should also create records that are known to be edge cases. At least one record with the maximum-length name the schema allows. At least one with non-Latin characters. At least one with a negative or zero value where the application expects positive numbers. At least one with an empty optional field. These deliberate edge cases are as important as the statistically representative bulk data, because they test the boundaries that random generation might not reach for thousands of iterations.

Synthetic Data vs. Anonymised Production Data

The alternative to synthetic data is copying production data and anonymising it. This sounds simpler and has a significant drawback: it starts with real personal data and tries to strip the identity, rather than starting with fabricated data that never contained real identity.

Anonymisation under GDPR has a high bar. Data is anonymised only if it cannot be used to single out an individual, link records to an individual, or infer information about an individual. Simple field-level masking (replacing names with "User_123") often fails this test because the combination of remaining fields (purchase history, address, account age) may be unique enough to identify someone. The UK's ICO has published guidance noting that a dataset can be personal data even after names are removed, if the remaining attributes are distinctive enough.

Pseudonymisation is explicitly not anonymisation under GDPR. It reduces risk but doesn't eliminate the data's status as personal data. Using pseudonymised production data in development still means processing personal data in the development environment, with all the compliance obligations that entails.

Synthetic data has none of these complications. The data was never real. There's no data subject. No purpose-limitation analysis. No security obligation tied to protecting someone's personal information. The development environment can be shared freely, backed up without concern, and discussed in Slack channels without compliance risks.

Financial-Data Seeding

Applications that handle financial data have additional requirements. Credit card numbers need to pass Luhn validation. BIN prefixes need to correspond to real card networks. Expiry dates need to be in the future (for active-card testing) or in the past (for expired-card testing). CVV lengths need to match the card network (3 digits for Visa/Mastercard, 4 for Amex).

Placeholder card numbers like "4111 1111 1111 1111" pass Luhn and are recognised as Visa test numbers, but they don't test the application's handling of different card networks. An Amex number is 15 digits with a 4-digit CVV. A Maestro number can be 12 to 19 digits. A form that hardcodes a 16-digit input mask breaks for both. Testing with only the Visa test number misses these code paths entirely.

Currency formatting also needs variety in the seed data. Not every currency uses two decimal places. The Kuwaiti dinar uses three. The Japanese yen uses zero. A price display function that assumes two decimal places will render incorrectly for both. The seed data should include records with different currencies to verify that the formatting logic handles the variation.

Handling Email-Dependent Features

Many applications send emails: welcome messages, password resets, order confirmations, notification digests. Testing these features in development requires email addresses that can receive mail, or at least an email capture mechanism that intercepts outgoing messages.

Mailhog, MailCatcher, and similar tools capture all outgoing email in a local web interface. The seed data's email addresses don't need to be real, they just need to look real enough that the email-sending logic doesn't reject them. "user@example.com" works for basic testing but doesn't exercise email validation that checks for valid MX records on the domain.

For applications that validate email domains (checking that the domain has MX records, that the domain isn't a known disposable provider, or that the domain matches the user's declared country), the seed data needs to use email addresses on real domains. Synthetic profiles that include email addresses on country-appropriate providers (gmx.de for German profiles, laposte.net for French ones) pass these validation checks without involving real mailboxes.

Scaling Up for Load Testing

Load testing requires a different scale than development seeding. Where development might need hundreds or thousands of records, load testing might need millions. The seed generation needs to be fast enough that producing this volume is practical.

Bulk database inserts are dramatically faster than individual ORM saves. A Django seed script that calls Model.objects.create() in a loop for a million records will take hours. The same script using Model.objects.bulk_create() with batch sizes of 5,000-10,000 will take minutes. The data generation itself should be the bottleneck, not the database insertion.

The load-testing dataset should include realistic distributions. Not every user has a shopping cart. Not every product has reviews. Not every order is completed. The distribution of data across states and categories should resemble production patterns, even if the individual records are synthetic. A load test against a database where every user has exactly five orders doesn't test the same query paths as a database where order counts follow a realistic power-law distribution.

The cost of discovering a layout bug, a validation gap, or a query-performance issue shifts leftward with every stage of the pipeline that uses realistic data. Development is cheaper than staging. Staging is cheaper than production. Production bugs that reach end users carry the highest cost in engineering time, customer trust, and support overhead. Replacing placeholder records with synthetic data at the earliest stage is the most efficient way to shift that cost curve. The fifty rows of "John Doe" were always a liability disguised as convenience.