Skip to content

Releases: wcmc-its/ReCiter

ReCiter 3.0

13 Mar 16:05
0cc74bc
Compare
Choose a tag to compare

ReCiter 3.0 Release Notes

Enhanced Scoring Methodology

In previous versions (ReCiter 2.0 and earlier), publication scoring relied heavily on identity-based methods and straightforward weighting, which occasionally failed to adequately reflect nuanced affiliations or feedback-driven importance. This method limited our ability to dynamically prioritize publications based on user-submitted feedback.

In version 3.0, we've introduced a significant enhancement by employing sigmoid functions to calculate attribute subscores dynamically based on user feedback. For example, if an author has even a small number of accepted publications with a particular affiliation not listed in institutional source systems, subsequent candidate publications with that affiliation will receive higher weighting. The more publications accepted with the same affiliation, the higher the weighting.

Attributes scored via sigmoid functions now include:

  • Target Author Name

  • Email

  • Institution

  • Organization

  • ORCID

  • ORCID Co-author

  • Co-author ORCID

  • Journal

  • Keyword

New Signals Incorporated:

  • Year of Publication: Candidate articles published before the earliest accepted article will now be increasingly penalized.

  • Count of Accepted Publications: Enhances relevance scoring based on previously accepted articles.

  • Count of Rejected Publications: Improves accuracy by considering articles previously rejected.

  • Author Count: Adjusts scoring by accounting for the increased uncertainty associated with publications having a higher number of authors.

  • Relationship Scoring: Enhances the accuracy by better utilizing the number of known relationships compared to the total number of co-authors. Additionally, first name matching is now required to be explicit and detailed.

  • Penalty for Inferred Target Authors: Added a penalty in cases where there have been 0 or 2+ target authors inferred, addressing a common source of false positives.

Neural Network Integration:

All attribute subscores, along with legacy identity-based scores, now feed into an advanced neural network model, significantly enhancing system accuracy. We have developed two distinct neural network models:

  • Feedback-Driven Model: Activated when feedback is available.

  • No-Feedback Model: Engaged when no prior feedback exists for the author.

These neural networks were fine-tuned through iterative experimentation, leading to an optimized model configuration delivering superior accuracy compared to previous methods.

Additional Improvements:

  • Improved Performance: Enhanced overall system performance by optimizing lookup processes and addressing inefficiencies.

  • No Results Fix: Previously, if a user's name did not exist in the eSearch API, results incorrectly defaulted to the first initial search (e.g., "M[au]"). This issue has been resolved for strict searches, lenient searches, and searches involving compound names.

  • Identity Checks: Added checks ensuring mandatory fields—firstName, lastName, and firstInitial—are required in the identity object.

  • Docker Hub Credentials: Included Docker Hub credentials in the Dockerfile to avoid the "image pull limit" error.

  • Degree Year Discrepancy Score: Improved the logic and effectiveness of the Degree Year Discrepancy scoring.

Related Repositories:

To fully utilize ReCiter 3.0, you must update the following related repositories:

This update marks a major step forward in refining publication matching accuracy and significantly boosts the effectiveness of user feedback within ReCiter.

ReCiter 2.1.5

04 Apr 02:57
dd525c3
Compare
Choose a tag to compare

Added ORCID ID to Reciter Identity Model wcmc-its/ReCiter-Identity-Model#7
Fixed issue #527

ReCiter 2.1.4

01 Sep 11:22
6630751
Compare
Choose a tag to compare

Outputs the "Equal Contribution" attribute (equalContrib) at the author level. This attribute when set to "yes" is an indication that any given authors who have that designation should share credit. Our intention is to use this to define co-senior and co-first author when it comes to publication reporting.

ReCiter 2.1.3

06 Apr 19:23
133c502
Compare
Choose a tag to compare

ReCiter 2.1.2

15 Dec 20:43
02f0476
Compare
Choose a tag to compare
  • #485 Fix log4j vulnerability
  • #486 Fix squiggly filters
  • #484 Bug fixes for feature generator by group. Feature generator by group api now accepts list of unique IDs as parameter. When this parameter is supplied all other filtering parameter is ignored. There is a new property in application.properties property to set the max allowed limit of uids to make sure the performance of the api is not impacted.
  • Suppress antlr runtime warnings

ReCiter 2.1.1

23 Aug 13:57
1e1c871
Compare
Choose a tag to compare

This release includes a bunch of bug fixes and enhancements especially improvements to nameScoring Strategy

  • #474 Name scoring strategy bug fix for mismatched names
  • #473 Addition of more meshMajor Terms
  • #455 Capture lookup_type in esearchresults
  • #370 Fix nameScoring bugs
  • #322 Output email even if it's not a match
  • #454 Candidate article count is wrong
  • #444 Update Feature Generator API so it returns count of pending publications for a scholar

ReCiter 2.1.0

04 Feb 17:29
e2b5966
Compare
Choose a tag to compare
  • Esearchresults table now include lookupType. This allows us to more reliably identify the count of candidate articles for the articleCountStrategy in cases where the ONLY_NEWLY_ADDED_PUBLICATIONS is used. #455
  • For articleCountStrategy, candidate article count now relies on distinct count of all retrieved publications except those from the gold standard retrieval strategy. #454
  • Time-based lookups against PubMed were only looking for articles based on date added to Entrez. This caused some publications to be missed. Now we're searching for that or date added to PubMed. #450
  • Update Swagger from 2.0 → 3.0. #447
  • Update Java 8 → 11. #446
  • Environment variable JAVA_OPTS was added to docker image to specify java heap size https://github.com/wcmc-its/ReCiter/blob/a3d5d4665e8692853ca69f2db0caba0eb56f557d/kubernetes/k8-deployment.yaml#L81-L82 and also to Dockerfile https://github.com/wcmc-its/ReCiter/blob/a3d5d4665e8692853ca69f2db0caba0eb56f557d/Dockerfile#L8
  • Output the top keywords and their counts for accepted publications. This will be used in Publication Manager. #442
  • Output count of pubs where userAssertion = NULL as attribute enhancement. This will be used in Publication Manager. #399
  • ReCiter Identity data model was updated to v2.0.8 wcmc-its/ReCiter-Identity-Model#3 to include primaryOrganizationalUnit, primaryInstitution, startDate, and endDate
  • ReCiter Article data model was updated to v2.0.16. This includes adding orcid identifier, affiliations and emails for authors, countOfPendingPubs, topArticleKeywords
  • Fixed error running DynamoDb locally in Docker. #452
  • Add healthcheck path for application use <protocol>://<host>:<port>/reciter/ping
  • Upgrade to all dependencies to use latest stable releases
  • AWS Codebuild images were also updated to use Java 11 and latest release
  • Docker image was updated to use adoptopenjdk/openjdk11:alpine-jre for security

ReCiter 2.0.0

21 Jul 19:24
0ec26df
Compare
Choose a tag to compare
  • Create a Multi-User Feature Generator API, which outputs pending articles for groups of scholars. This can be used in Publication Manager to quickly review pending publications for large groups of people. #330
  • Feature Generator API now outputs:
    • ORCID identifiers associated with authors #336
    • an identifier associated with each cluster #365
    • MeSH terms #402
  • More powerful use of the year when scholars received their degree. #391
  • Identity API returns list of scholars via S3-based cache, significantly improving performance of Publication Manager. #400
  • Support for Kubernetes, an open-source system for automating deployment, scaling, and management of containerized application
  • Bug fix: Analysis objects are in both DynamoDB Analysis table and s3, and should only be in s3 #392
  • Bug fix: incremental lookup
  • Updated timeout settings
  • Add performance metrics for s3 caching
  • Updated article and identity models in Maven Central

ReCiter 1.2

01 Sep 14:35
e4e339b
Compare
Choose a tag to compare
  • Evidence weights in application.properties are now optimized according to a support vector machine analysis
  • Created a userFeedback service for feedback from Publications Manager
  • Added an API controller in Swagger for ReCiter Publications Manager
  • Fixed a bug in common affiliation strategy
  • Bucket names in S3 are dynamically created
  • Fixed affiliation count of non-target authors. #361

ReCiter 1.1

12 Jun 20:51
f9e2c33
Compare
Choose a tag to compare

Release notes for ReCiter 1.1

  • Use name to infer gender of targetAuthor and identity. Downweight cases where there's a difference in inferred gender. #357
  • Tracks a person’s original name as recorded in a source system and outputs it in the feature generator as opposed to using the sanitized/standardized version of that name. #317
  • Tracks an organization’s original name as recorded in a source system and outputs it in the feature generator as opposed to using the standardized version and/or synonym of that name. #356
  • Single matching departmental affiliation, no matter the synonyms, should only count once. #326
  • Update articleCountScoringStrategy so it better accounts for retrieval counts in strict mode. This way people with more common names get lower scores for articleCountScoringStrategy - even though their looks up are done in strict mode. #278
  • Penalize relationship scores in cases for each non-match. This will address cases where there are a lot of co-authors and just by sheer chance some of them have a known relationship match. #341
  • Added ScienceMetrix journalDepartmentCategory scores. This covers the 250+ most common organizational affiliations in PubMed and their scores for all 180 subfields. #352
  • The number of organizational unit synonyms has been expanded. In many cases, it includes commons translations, e.g., Cirugia (Surgery). This expands the coverage of journalDepartmentCategory scoring. #354
  • journalDepartmentCategory scoring should pick most favorable match. This is useful in cases where a person has multiple organizational affiliations, one of which scores highly. #355
  • Improved method for identifying target author. It turns out author’s email is often not assigned to the person behind that email. #185