Why we merge OpenAlex author profiles

OpenAlex sometimes splits the same researcher across two or more author IDs. Author Trends tries to detect those duplicates and combine them into one profile so the trend charts reflect a full career β€” not a sliced one. Here's how the merge works and where it can still get things wrong.

The classic three-Spider-Men-pointing meme β€” three identical figures pointing at each other.
OpenAlex sometimes gives one researcher several IDs. Each one swears it's the real Spider-Man.

The problem in one example

Imagine a fictional aerospace researcher named John Doe. He published his first paper in 2002 at the Indian Institute of Technology Bombay, then continued at the same institution. Around 2012 he moved to MIT and started using a slightly different byline: "J. Doe" rather than "John Doe". He also registered a second ORCID along the way without realising the first existed.

From OpenAlex's perspective there are now two distinct author records:

Neither profile alone tells the full story. A11111111 misses the MIT years; A22222222 misses the IIT Bombay years. Both profiles have low h-indices because their citation history is split. If a user clicks either profile, the country / institution / collaborator charts will be incomplete and the timeline will look like a researcher who suddenly stopped or suddenly started, with no continuity.

The heuristic

For every pair of search results returned by OpenAlex, Author Trends asks: do these look like the same person? We answer "yes" only when all three of these signals agree:

  1. Name tokens overlap. We tokenize each display name, drop initials (any token shorter than 3 characters) and check how many remaining tokens are shared. We need at least 2.
    John Doe β†’ {john, doe}
    J. Doe β†’ {doe} β€” only one shared token. Hmm.
    But if both bylines had been John Doe vs John A. Doe: {john, doe} ∩ {john, doe} = 2. Pass.
  2. Affiliations overlap. Normalize each institution name (lowercase, drop "the", "of", "and", "university", etc.) and count how many are shared. We need at least one. In our John Doe example: both profiles list "Indian Institute of Technology Bombay" during the 2012 overlap year. Pass.
  3. Research concepts overlap. OpenAlex tags each work with concepts (Aerospace engineering, Mechanics, …). We need at least one shared concept across the two profiles. Both Does publish in Aerospace engineering. Pass.

Two profiles can also be merged by an identical ORCID alone β€” that's the strongest possible signal and it short-circuits the rules above.

Clustering is greedy single-linkage: if profile A merges with B and B merges with C, the three of them collapse into one cluster even if A and C don't independently pass the test.

What the merged profile looks like

Once a cluster is formed, the profile with the highest works_count becomes the "primary" β€” that ID is what the URL and the cache key use. Then:

Where the merge can still fail

The heuristic is intentionally conservative β€” we'd rather show two separate profiles than wrongly fuse two different people who share a common name. But that means we sometimes miss real duplicates, especially when:

If you spot a case where the merge should have happened (or shouldn't have), please tell us through the feedback form β€” we use these reports to tune the heuristic.

And: if there are papers that you have not published wrongly in your profile, or if there are other OpenAlex profiles which contain your paper please contact OpenAlex directly to rectify it using this form. Author Trends just renders what OpenAlex returns β€” fixing the attribution at the source is the cleanest path.

A note on transparency

Every merged card carries an expandable "Source profiles" section listing each member ID, its ORCID, and its individual works count. Nothing is hidden behind the merge β€” if the result looks wrong, you can always click through to the original OpenAlex records and verify.

← Back to Author Trends