First names and regional patterns in UK universities

research data bite 20.

Jun 05, 2025

Key takeaways
Even simple metadata fields, like first names, can yield rich insights when seen in a geographic and statistical context. Bibliometrics is not just about citation counts, but also tracing the social fabric of the research system.
Despite the mobility and internationalism of academia, regional cultural signals persist; first names still carry the imprint of place, identity, and local academic ecosystems.

Dimensions’ approach to researcher disambiguation is deliberately conservative. We would rather split a researcher’s portfolio in two than mistakenly merge two distinct individuals. That is often the right call, but it does mean that now and then I occasionally have to manually curate researcher lists to ensure we have the most complete profiles.

That is what happened recently while compiling profiles for over 300 researchers involved in the UKRI-funded Trustworthy Autonomous Systems programme (our report is on figshare). Somewhere in the middle of reviewing names, I started wondering: could a researcher’s first name reveal something about their location, despite researchers’ known mobility?

To explore this, I used the first name stored in each researcher’s aggregated Dimensions profile. Surprisingly, researchers’ names are not always consistent. English names often have a wide range of casual, shortened, or initial-only variants (see this arXiv by Simon and Daniel here at DS for an investigation of the initial era), and middle names may appear or disappear without warning. Many researchers shift how they present their name over time: sometimes going from Nicholas to Nick, or from Samantha J. to just Sam. However, when Dimensions aggregates profiles, it serves the cleanest version of the name they have access to.

Geographic patterns are best spotted with maps, but plotting every institution in the UK would have made London institutions unreadable. So instead, I grouped researchers by the 12 NUTS1 UK regions (already available via GRID metadata) and asked a simpler question: which names are disproportionately common in which regions?

To answer that, I calculated a z-score for each name–region pair. This captures how prevalent a name is in a given region compared to the national average, standardised across the dataset. A high z-score suggests local overrepresentation; a low one, relative scarcity.

There are nearly 80,000 first names for researchers that published with a UK affiliation between 2015 and 2024, so mapping all possible names would have overwhelmed the visualisation. With the help of ChatGPT and Gemini, I selected a diverse sample of 40: a mix of masculine, feminine, unisex, culturally distinctive, and regionally evocative names. You will find the resulting map below, just click the name at the top to explore how its regional prevalence varies.

Regional prevalence summary

The table below summarises how specific first names stand out across UK regions based on their relative prevalence. Using z-scores as a standardised measure, names are grouped into three categories per region:

Strong prevalence: z-scores greater than 6, indicating names that are highly overrepresented locally.
Mild prevalence: z-scores between 2 and 6, suggesting moderate regional distinctiveness.
Low prevalence: z-scores below 2, included here only when they’re the highest for that name.

This does not reflect where a name is most common in absolute terms, but where it is most distinctive relative to national patterns. For instance, a name might have moderate presence across many regions but only appear here where it stands out most. As such, the table should be read as a guide to localised naming patterns, not total name counts.

Conclusion

What is striking about this small experiment is how much can be inferred from just one field in a researcher’s publication profile. Dimensions aggregates researcher-level data across outputs, and while the focus is often on citations, collaboration, or funders, something as simple as a first name and a location can reveal unexpected sociocultural patterns.

Researchers are often portrayed as highly mobile: the academic career path encourages geographic movement, especially across countries and regions. And yet, in the data, we still observe clusters of names that resonate with local or regional identity. Welsh names like Lowri and Rhys are indeed most overrepresented in Wales; Fiona and Euan stand out in Scotland. Despite the global nature of academic publishing, place still shows up in subtle but persistent ways.

This suggests that even in a system designed to flatten borders—through common publication platforms, citation networks, and international collaborations—some regional patterns endure. And thanks to researcher profiles like those in Dimensions, these patterns are now easier than ever to explore.

Code

Code written by DimQuery Assistant, GPT for Dimensions on GBQ. Contact me to test it.

WITH author_data AS (
  SELECT
    TRIM(SPLIT(r.first_name, ' ')[OFFSET(0)]) AS first_name,
    r.id AS researcher_id,
    aff.grid_id AS grid_id
  FROM `dimensions-ai.data_analytics.publications` p
  CROSS JOIN UNNEST(p.authors) AS a
  INNER JOIN `dimensions-ai.data_analytics.researchers` r
    ON a.researcher_id = r.id
  CROSS JOIN UNNEST(a.affiliations_address) AS aff
  WHERE
    p.year BETWEEN 2015 AND 2024
    AND aff.country_code = 'GB'
    AND aff.grid_id IS NOT NULL
),

grid_regions AS (
  SELECT
    id AS grid_id,
    address.geonames_city.nuts_level1.name AS nuts1_region,
    types
  FROM `dimensions-ai.data_analytics.grid`
  WHERE address.geonames_city.nuts_level1.name IS NOT NULL
),

selected_names AS (
  SELECT * FROM UNNEST([
    STRUCT("David" AS first_name, "male" AS gender),
    STRUCT("John", "male"),
    STRUCT("James", "male"),
    STRUCT("Michael", "male"),
    STRUCT("Robert", "male"),
    STRUCT("Mohammed", "male"),
    STRUCT("Andrew", "male"),
    STRUCT("Alistair", "male"),
    STRUCT("Rhys", "male"),
    STRUCT("Hamza", "male"),
    STRUCT("Nigel", "male"),
    STRUCT("Euan", "male"),
    STRUCT("Kieran", "male"),
    STRUCT("Rajesh", "male"),
    STRUCT("Callum", "male"),
    STRUCT("Gareth", "male"),
    STRUCT("Omar", "male"),
    STRUCT("Ramesh", "male"),
    STRUCT("Abdul", "male"),
    STRUCT("Aidan", "male"),

    STRUCT("Emma", "female"),
    STRUCT("Sarah", "female"),
    STRUCT("Elizabeth", "female"),
    STRUCT("Catherine", "female"),
    STRUCT("Rebecca", "female"),
    STRUCT("Emily", "female"),
    STRUCT("Aisha", "female"),
    STRUCT("Yasmin", "female"),
    STRUCT("Claire", "female"),
    STRUCT("Olivia", "female"),
    STRUCT("Anna", "female"),
    STRUCT("Fiona", "female"),
    STRUCT("Sian", "female"),
    STRUCT("Bronwyn", "female"),
    STRUCT("Mei", "female"),
    STRUCT("Sinead", "female"),
    STRUCT("Priya", "female"),
    STRUCT("Iona", "female"),
    STRUCT("Lowri", "female"),
    STRUCT("Zara", "female")
  ]) AS selected_names
),

researcher_with_region AS (
  SELECT
    ad.first_name,
    sn.gender,
    gr.nuts1_region,
    ad.researcher_id
  FROM author_data ad
  JOIN grid_regions gr
    ON ad.grid_id = gr.grid_id
  JOIN selected_names sn
    ON LOWER(ad.first_name) = LOWER(sn.first_name)
  WHERE "Education" IN UNNEST(gr.types)
),

all_researchers_with_region AS (
  SELECT
    gr.nuts1_region,
    ad.researcher_id
  FROM author_data ad
  JOIN grid_regions gr
    ON ad.grid_id = gr.grid_id
  WHERE "Education" IN UNNEST(gr.types)
),

name_region_counts AS (
  SELECT
    first_name,
    gender,
    nuts1_region,
    COUNT(DISTINCT researcher_id) AS name_count
  FROM researcher_with_region
  GROUP BY first_name, gender, nuts1_region
),

region_totals AS (
  SELECT
    nuts1_region,
    COUNT(DISTINCT researcher_id) AS region_total
  FROM all_researchers_with_region
  GROUP BY nuts1_region
),

name_totals AS (
  SELECT
    first_name,
    gender,
    COUNT(DISTINCT researcher_id) AS national_name_total
  FROM researcher_with_region
  GROUP BY first_name, gender
),

national_total AS (
  SELECT COUNT(DISTINCT researcher_id) AS total_uk_researchers
  FROM all_researchers_with_region
)

SELECT
  nrc.first_name,
  nrc.gender,
  nrc.nuts1_region,
  nrc.name_count,
  rt.region_total,
  nt.national_name_total,
  nat.total_uk_researchers,
  SAFE_DIVIDE(nrc.name_count, rt.region_total) AS regional_share,
  SAFE_DIVIDE(nt.national_name_total, nat.total_uk_researchers) AS national_share,
  SAFE_DIVIDE(
    SAFE_DIVIDE(nrc.name_count, rt.region_total),
    SAFE_DIVIDE(nt.national_name_total, nat.total_uk_researchers)
  ) AS skew_ratio,
  SAFE_MULTIPLY(
    SAFE_DIVIDE(
      SAFE_DIVIDE(nrc.name_count, rt.region_total),
      SAFE_DIVIDE(nt.national_name_total, nat.total_uk_researchers)
    ),
    LOG(nrc.name_count + 1)
  ) AS weighted_skew,
  SAFE_DIVIDE(
    (nrc.name_count - SAFE_MULTIPLY(rt.region_total, SAFE_DIVIDE(nt.national_name_total, nat.total_uk_researchers))),
    SQRT(SAFE_MULTIPLY(rt.region_total, SAFE_DIVIDE(nt.national_name_total, nat.total_uk_researchers)))
  ) AS z_score
FROM name_region_counts nrc
JOIN region_totals rt USING (nuts1_region)
JOIN name_totals nt USING (first_name)
CROSS JOIN national_total nat
ORDER BY gender, first_name, skew_ratio DESC

research musings

Discussion about this post