Guide July 2025

What Goes into a Realistic Synthetic Profile (and Why Every Field Depends on Every Other Field)

A name and an address do not make a realistic profile. Plenty of generators will spit out a random first name, a random street, a random phone number, and call it done. The result looks plausible for about three seconds. Then someone notices the German first name paired with a Brazilian postal code, the UK phone number formatted with a US area code, and a credit card BIN that belongs to a bank that doesn't operate in the declared country. Each field might be valid in isolation. Together they form something no real person's data would ever look like.

The problem isn't generating data. The problem is generating data where every field reinforces every other field. Internal consistency is what separates a synthetic profile that passes automated validation from one that gets flagged as test data on first contact.

The Dependency Chain

Think of a synthetic profile as a directed graph where each node constrains the nodes downstream. Country is the root. Everything branches from it. The country determines the name pool, the address format, the phone number structure, the postal code pattern, the national ID type, the tax identification format, and the available card networks. Change the country, and every downstream field has to change with it.

This is where most generators fall apart. They treat each field as independent. A random name function pulls from a global list. A random address function generates a plausible-looking street in a plausible-looking city. A random phone function produces digits in roughly the right count. The fields don't talk to each other. The profile that comes out is a collage of fragments from different countries, different conventions, different systems.

Realistic generation works the other way around. Start with a country. Derive the name from that country's naming conventions (patronymic rules in Iceland, family name first in Hungary, compound surnames in Spain). Derive the address from that country's structure. Derive the phone number from the correct international prefix and area code rules. Derive the card details from BIN ranges assigned to issuers in that country. Every decision flows from the one before it.

Geographic Coherence in Detail

Country-level consistency is the baseline. City-level consistency is where things get interesting and where most synthetic data quietly breaks.

A French profile with a Paris address should have a postal code starting with 75. A French profile with a Marseille address should start with 13. A postal code of 75001 paired with a city of Lyon is wrong. No human being has that combination in their records. An automated system checking address coherence will flag it.

Phone area codes have the same issue. In the US, a New York City address paired with a 310 area code (Los Angeles) isn't impossible - people move and keep their numbers - but it's a signal. For testing purposes, a profile where the area code matches the city is more useful because it doesn't introduce confounding variables. The test should be exercising the payment form or the registration flow, not accidentally triggering the fraud detection layer.

Time zones add another layer. A profile claiming to be in Berlin should have timestamps consistent with CET/CEST. If the profile is used to simulate user activity, login times at 3 AM local time every day look automated. A profile that generates activity during plausible waking hours for its declared timezone behaves the way a real user's data would.

Then there's the subtlety of address formatting itself. In Germany, the house number comes after the street name (Hauptstraße 42, not 42 Hauptstraße). In Japan, addresses go from largest to smallest: prefecture, city, ward, block, building. In the Netherlands, postal codes follow a four-digit-two-letter pattern (1234 AB). Getting the content right but the format wrong is a giveaway. Systems that parse addresses by position will misread a Japanese address formatted in Western order.

Demographic Consistency

Names carry demographic signals that need to align with the rest of the profile. A first name can suggest approximate generation (the most popular baby names in 1985 differ from those in 2005), gender in languages with gendered names, and cultural or ethnic background within a country. None of these need to be deterministic - real populations are diverse - but statistical plausibility matters.

A profile for a 65-year-old Japanese woman named "Yuki" is plausible. A profile for a 65-year-old Japanese woman named "Brooklyn" is not, unless the profile specifically represents someone raised abroad. A profile for a 22-year-old German man named "Wolfgang" is technically possible but statistically unusual - that name peaked decades ago. "Lukas" or "Leon" would be more typical for that birth year.

Date of birth interacts with other fields too. The birth date constrains the age, which constrains the expiry date on a driving licence (some countries issue them for fixed terms based on age), which constrains the national ID number format (some countries encode the birth year in the ID). A profile that claims a birth year of 1990 but has a national ID encoding 1975 has a detectable inconsistency. Most humans wouldn't notice. Automated verification systems will.

Gender markers create their own consistency requirements. In countries where national IDs encode gender (many do, through odd/even digit rules or explicit gender fields), the ID number has to match the declared gender. In countries with gendered name endings (Russian, Czech, Polish), the surname suffix has to match. Aleksandra Novák in a Czech profile needs the feminine form: Nováková. Getting this wrong doesn't just look fake - it looks like someone doesn't understand the language.

Financial Data Integrity

Credit card numbers aren't random digit strings. They follow a structure: a BIN (Bank Identification Number) prefix that identifies the issuer and network, a variable-length account number, and a check digit calculated via the Luhn algorithm. A realistic synthetic card number needs all three layers to be correct.

The BIN prefix has to correspond to a real issuer in the profile's country. A German profile with a Visa card should use a BIN assigned to a German bank. A Brazilian profile with an Elo card should use an Elo BIN range. The BIN prefix also determines the card network, which affects the number length (15 for Amex, 16 for most Visa/Mastercard, up to 19 for some Maestro) and the CVV length (4 for Amex, 3 for everyone else).

The Luhn check digit is a simple algorithm, but skipping it produces card numbers that fail the first validation step in any payment form. Every real card number passes Luhn. Every synthetic card number should too, otherwise the test never reaches the actual payment processing logic - it gets stopped at client-side validation, which is not what's being tested.

Expiry dates have their own constraints. A card that expired two years ago is useful for testing expired-card handling, but the default case should be a future date. The expiry month and year should be plausible - cards are typically issued for three to five years, so an expiry date 15 years in the future looks wrong. Small details. Collectively, they're the difference between synthetic data that blends in and synthetic data that screams "test."

National Identification Documents

National ID numbers aren't arbitrary. Most countries embed structured information in them. The US Social Security Number has area-number-group rules (though the randomisation change in 2011 loosened this). The UK National Insurance Number has a specific letter-digit pattern with certain prefix combinations reserved or invalid. The Brazilian CPF includes two check digits calculated from the preceding nine. The French NIR (numéro de sécurité sociale) encodes gender, birth year, birth month, department of birth, and commune of birth in a 13-digit sequence plus a two-digit key.

A synthetic French profile with a NIR that encodes a different birth year than the profile's declared date of birth has a detectable inconsistency. A synthetic Brazilian profile with a CPF that fails the check-digit calculation will be rejected by any system that validates CPFs. A synthetic German profile with a Steueridentifikationsnummer (tax ID) that doesn't follow the 11-digit, one-repeated-digit rule will fail automated verification.

Getting these right requires country-specific generation logic. There's no universal formula. Each country's ID system has its own rules, its own validation algorithms, and its own edge cases. That's partly why so many synthetic data generators skip national IDs entirely or generate obviously fake ones - the implementation effort is significant. But for any testing scenario that involves identity verification, incomplete national IDs are a gap in coverage.

Email Address Realism

Email addresses carry more information than people assume. A first name and last name concatenated with a dot at gmail.com is plausible. A string of random characters at a domain that doesn't exist is not. A realistic synthetic email should use the profile's name (or a recognisable derivative), append a plausible domain, and avoid patterns that spam filters or manual reviewers would flag.

Domain selection matters. Free providers (Gmail, Yahoo, Hotmail) are common for personal profiles. Corporate domains are appropriate for business profiles. Country-specific providers add realism: mail.ru for Russian profiles, web.de or gmx.de for German profiles, laposte.net for French profiles. A Japanese profile with a @yahoo.co.jp address is more plausible than one with @yahoo.com. The distinction is minor but cumulative. Enough minor wrong details and the profile stops passing the smell test.

Username patterns should match the cultural context. Western email addresses commonly use firstname.lastname or firstnamelastname formats. Japanese addresses might use romanised names or initials. Profiles for younger demographics might include birth years or short handles rather than full names. A 60-year-old German professor with the email "xXDarkWolf2003Xx@gmail.com" breaks the suspension of disbelief, even if the address is technically valid.

Testing for Edge Cases

Beyond baseline consistency, realistic profiles need to handle edge cases that exist in real populations. Hyphenated surnames. Apostrophes in names (O'Brien, D'Angelo). Names with non-ASCII characters (ñ, ü, ø, ß). Addresses with apartment numbers, building names, floor numbers. Postal codes with spaces (Canadian format: A1A 1A1). Phone numbers with extensions.

These edge cases break systems that weren't designed for them. A name field that strips apostrophes turns "O'Brien" into "OBrien." A phone number parser that doesn't handle the +44 prefix for UK numbers will misparse the entire number. An address field with a 30-character limit silently truncates German street names like "Friedrichswerdersche Kirchstraße." Each of these is a real bug in a real system, and each is caught by generating profiles that include these patterns.

Multi-nationality profiles are another edge case worth generating. People who live in one country but have identity documents from another. A Brazilian citizen living in Portugal with a Portuguese phone number, a Brazilian CPF, and a Portuguese NIF (tax number). These profiles test whether systems can handle mixed-country data without rejecting it as inconsistent. The answer, frequently, is that they can't - which is precisely why testing for it matters.

Why All of This Matters

A profile where every field is independently valid but collectively incoherent fails in three predictable ways.

Automated validation catches it. Address verification services compare the postal code against the city. Phone validation APIs check the area code against the country. Payment processors compare the card BIN against the billing address country. Any mismatch triggers a flag, a rejection, or a manual review queue. If the test data itself triggers these flags, the test isn't exercising the application logic - it's exercising the fraud detection layer. That's a different test entirely.

Manual review spots it. Researchers investigating a platform with a thin, contradictory profile risk having the account suspended before the investigation produces results. QA testers with obviously fake data contaminate staging environments with records that don't resemble production data, which means the staging environment stops being a useful predictor of production behaviour.

Generators like Another.IO build profiles where the dependency chain is maintained end to end: country to name to address to phone to card to national ID. The profile doesn't need to belong to a real person. It needs to be indistinguishable from one in every system that processes it. That's a higher bar than random generation, and it's the bar that matters.