Blooming research profiles: cultivating identity with ORCiD and Dimensions
Research data bites: 16.
Key takeaways:
It is now easy to access ORCID data with Google BigQuery; we can now enhance Dimensions data.
High engagement in Finland, Poland, South Africa, Zimbabwe, and Ghana; keywords show global interest in Machine Learning and Education.
In small groups, first names are enough for identification, but as populations grow, additional details—last names, middle names, birthdates, and birthplaces—become necessary. To simplify this complexity, unique identifiers like national health or insurance numbers have been introduced. The same logic applies to academia, where researcher profiles have emerged from three primary needs: publishers tracking manuscripts, digital databases aggregating multiple sources, and user-curated profiles.
Although ORCID was not the first attempt at creating cross-platform identifiers, it was the first interoperable researcher persistent identifier—it is researcher controlled, open access, tracks works (publications, grants, peer reviews, datasets, and more added recently) as well as employment, and education history.
Launched in late 2012 by Will Simpson, Laurel Haak, and Laura Paglione, ORCID grew to 21 million records by September 2024. Despite being open-access, the size of the data file made it challenging to work with. Now, with ORCID data in Google BigQuery—where Dimensions also resides—the potential for analysis is nearly limitless. You can find samples here, credit to Simon, and here is a short introduction to the data, some quick analyses and their GBQ code.
ORCID and Dimensions are both truth in their own rights: ORCID provides highly accurate researcher profiles curated by individuals, while Dimensions offers broader coverage of global research trends by aggregating outputs across multiple sources.
The data in GBQ is organised into five key features:
orcid_identifier: the unique identifier.
preferences: language.
history: anything related to the profile creation, access, and verification.
person: name, bio, websites, emails, addresses, and keywords.
activities: educations, employments, fundings, peer_reviews, and works.
Profile creation
We first looked at the number of profiles created each month, presenting both a monthly and cumulative view. The data includes total profiles, profiles with last names, profiles with works, profiles with employment history, and profiles with keywords.
The monthly trend shows dips during the December/January and July/August months (holiday periods in both hemispheres), as well as some jump from March 2024 for data with employment information. The decline in profiles with works suggest that most profiles added are new researchers and will progressively be filled up.
A fun way to look at the data was to explore the time at which the profile based on the country of employment. First, the same way profiles were created during working months (as well as during the working week, not shown here), the same could be said of the time of creation. We see many countries with clearly defined lunch breaks.
Person
3.5 million profiles included at least one keyword. We mapped the most frequent keyword based on the last employment country. Numbers were quite low for most countries in Sept. 2024, so there is a large bias towards those who filled metadata; machine learning and education experts seem to be keen on filling carefully their profiles.
We identified key trends based on employment location:
Machine learning is a clear winner in most countries; 'artificial intelligence' is usually just behind.
Education dominates in Indonesia and The Philipines, and its Spanish and Portuguese versions in Central and South America.
Medicina general integral is prominent in Cuba.
Climate change appears the most frequently in South Africa; public health in Ghana.
Central Europe shows a diverse mix with molecular biology in Czechia, management in Ukraine and Romania, veterinary (ветеринария) in Belarus.
Some researchers entered university names rather than research topics: Circle U in Serbia, SZAOK (University of Szeged) in Hungary, national chengchi university in Taiwan. Or even a country with panamá.
Activities—works
With ORCID data now in BigQuery, integrating it with Dimensions and other datasets is seamless. We compared researcher profiles across both platforms, using World Bank data (also in GBQ) to contextualize results by region and income level. Instead of relying on employment records, we identified researchers through their ORCID-listed works, matched them in Dimensions, and approximated their country. Given the high values of China and the US, we used log scales for the axes.
Based on this data, we calculated the ORCID adoption ratio, and found that:
Bhutan, Finland, Poland, and South Africa stand out for their exceptionally high ORCID adoption rates; as well as Zimbabwe and Ghana, both Lower Middle Income countries.
China, Brazil, and Andorra underperform in ORCID adoption relative to their research output; adoption in Japan and Spain is low considering these are High Income countries.
Some small nations, like Malta and Iceland, have surprisingly high ORCID engagement.
Conclusion
With ORCID now in Google BigQuery, interoperability across research datasets is seamless, enabling large-scale analysis by integrating it with Dimensions or open datasets such as World Bank data. This allows for tracking researcher profiles, affiliations, and works across platforms, uncovering trends in ORCID adoption, research mobility, and metadata completeness.
Our analysis revealed unexpected adoption patterns—high engagement in Finland, Poland, and South Africa, and lower uptake in China, Japan, and Spain despite their research output. Linking ORCID works to Dimensions data also highlighted national research focuses, from machine learning dominance in Europe to education and public health priorities in Southeast Asia and Africa.
With ORCID's extensive researcher profiles, we can now enhance Dimensions data for relevant projects.
Code
Profile creation
WITH submissiondate AS (
SELECT
orcid_identifier.path AS orcid_id,
MIN(TIMESTAMP(history.submission_date)) AS submission_date
FROM `ds-open-datasets.orcid.summaries_2024`
WHERE history.submission_date IS NOT NULL
GROUP BY orcid_id
),
monthly_data AS (
SELECT
EXTRACT(YEAR FROM sd.submission_date) AS year,
EXTRACT(MONTH FROM sd.submission_date) AS month,
COUNT(*) AS orcid_count,
COUNTIF(ARRAY_LENGTH(o.activities.employments.groups) > 0) AS employment_not_null,
COUNTIF(o.person.name.family_name IS NOT NULL) AS last_name_not_null,
COUNTIF(o.person.keywords IS NOT NULL) AS keywords_not_null,
COUNTIF(ARRAY_LENGTH(o.activities.works.groups) > 0) AS works_not_null
FROM submissiondate sd
JOIN `ds-open-datasets.orcid.summaries_2024` o
ON o.orcid_identifier.path = sd.orcid_id
GROUP BY year, month
)
SELECT
CONCAT(CAST(year AS STRING), "-", CAST(month AS STRING)) AS yearmonth,
orcid_count,
employment_not_null,
last_name_not_null,
keywords_not_null,
works_not_null,
SUM(orcid_count) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS orcid_count_cumu,
SUM(employment_not_null) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS employment_not_null_cumu,
SUM(last_name_not_null) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS last_name_not_null_cumu,
SUM(keywords_not_null) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS keywords_not_null_cumu,
SUM(works_not_null) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS works_not_null_cumu
FROM monthly_data
ORDER BY year, month
Person—keywords
We use the country of the last employment found in the `activities` feature.
WITH latest_employment AS (
-- For each ORCID profile, get the latest employment record
SELECT
orcid_id,
country
FROM (
SELECT
orcid_identifier.path AS orcid_id,
rec.organization.address.country AS country,
ROW_NUMBER() OVER (
PARTITION BY orcid_identifier.path
ORDER BY TIMESTAMP(rec.last_modified_date) DESC
) AS rn
FROM `ds-open-datasets.orcid.summaries_2024`,
UNNEST(activities.employments.groups) AS grp,
UNNEST(grp.records) AS rec
WHERE rec.organization.address.country IS NOT NULL
)
WHERE rn = 1
),
keywords_by_profile AS (
-- Extract keywords from each ORCID profile
SELECT
orcid_identifier.path AS orcid_id,
LOWER(kw.content) AS keyword
FROM `ds-open-datasets.orcid.summaries_2024`,
UNNEST(person.keywords.keywords) AS kw
),
joined AS (
-- Join the latest employment country to the keywords using the profile ORCID id
SELECT
le.country,
kp.keyword
FROM latest_employment le
JOIN keywords_by_profile kp
ON le.orcid_id = kp.orcid_id
),
keyword_counts AS (
-- Count the frequency of each keyword per country
SELECT
country,
keyword,
COUNT(*) AS keyword_count
FROM joined
GROUP BY country, keyword
),
ranked AS (
-- Rank the keywords per country based on their counts (highest first)
SELECT
country,
keyword,
keyword_count,
ROW_NUMBER() OVER (PARTITION BY country ORDER BY keyword_count DESC) AS rn
FROM keyword_counts
)
SELECT
country AS country_code,
short_name AS country_name,
keyword,
keyword_count
FROM ranked
LEFT JOIN `bigquery-public-data.world_bank_wdi.country_summary` wdi
ON ranked.country = wdi.two_alpha_code
WHERE rn <= 1
AND keyword_count > 20
ORDER BY country_name, keyword_count DESC;
Activities
In Dimensions, we could either use the researchers who have been active since 2012 (last_publication_year > 2012), but here we use the
WITH dimensions_profiles AS (
SELECT affadd.country_code AS CountryCode, COUNT(DISTINCT researcher_id) AS count_dim_res
FROM `dimensions-ai.data_analytics.publications`
CROSS JOIN UNNEST(authors) auth
CROSS JOIN UNNEST(auth.affiliations_address) affadd
WHERE year > 2012
GROUP BY CountryCode
),
ORCID_employment AS (
SELECT
orcid_identifier.path AS orcid_id,
rec.organization.address.country AS CountryCode
FROM `ds-open-datasets.orcid.summaries_2024`,
UNNEST(activities.employments.groups) AS grp,
UNNEST(grp.records) AS rec
WHERE rec.organization.address.country IS NOT NULL
),
ORCID_profiles AS (SELECT
CountryCode,
COUNT(DISTINCT orcid_id) AS orcid_count
FROM ORCID_employment
GROUP BY CountryCode
ORDER BY orcid_count DESC
)
SELECT d.CountryCode, count_dim_res, orcid_count
FROM dimensions_profiles d
FULL OUTER JOIN ORCID_profiles o
ON d.CountryCode = o.CountryCode