Machine Learning Highlights for Rich Context

11 min readJan 6, 2020

We’ve had a busy 2019 at NYU Coleridge Initiative and other organizations partnering on Rich Context. Now we’re announcing a machine learning competition, with full details at the https://github.com/Coleridge-Initiative/rclc repo on GitHub. This competition focuses on entity linking to infer from open access PDFs research publication about which datasets were used for that specific research.

The following provides background about how the ML models from the competition fit into the broader scope of Rich Context, as well as how the corpus gets developed, plus ways in which the competition has been improved.

Machine Learning Competition

We’ve gained much feedback about Rich Context and key learnings through:

The first RCC competition (2018–2019)
The first Rich Context Workshop — Feb 2019, NYC
The Rich Search and Discovery for Research Datasets book (in production)
The second Rich Context Workshop — Nov 2019, DC

Plus lots of networking within the research community by presenting at conferences, guest lectures and seminars at universities, meeting privately with AI start-ups, and so on.

Important takeaways from the first ML competition were the suggestions about how to improve the competition. Consequently we have reworked the competition. See the newly re-launched site at https://github.com/Coleridge-Initiative/rclc

In this iteration of the competition we’re focusing on the supervised learning problem first, specifically as entity linking into the knowledge graph (KG). Datasets are described more generally, and the training/testing data is carefully curated, with several unit tests added to improve consistency and data quality.

Change: We follow the contemporary pattern of state-of-the-art (SOTA) leaderboards, such as NLP-Progress by Sebastian Ruder.

Results are now easier to view and competing teams can learn from each other over time.

Change: Instead of reserving a private test holdout, the entire corpus is published as a gold standard, based on publications that have open-access PDFs available, and hosted publically on GitHub.

This allows more flexibility for people to engage, plus a broader range of people giving feedback and suggestions.
We re-run previously trained ML models on subsequent releases to avoid overfitting (with the larger, previously unseen data).
Teams cooperate via GitHub issues to identify issues in the corpus, identify metadata errors that require correction, share utility code, and help refine the rules of the ongoing competition.

Change: The corpus is represented in JSON-LD format using open standards for metadata to represent the public version of our KG.

This allows us to leverage open source tools for validation, visualization, etc.
Teams spend less time on coding for data access, and get to spend more time on ML modeling.
There is substantially less overhead required to evaluate newly submitted ML models.
The corpus can be used directly with the new Rich Context features in JupyterLab.

Change: The evaluation metric now focuses on precision, using an adaptation of Top K that allows for multiple datasets per publication.

In contrast to the relatively noisy results (high recall) from the first RCC competition, higher precision in the ML models helps produce better recommendations for end users.

As a result, the initial leaderboard entries improved substantially:

78% precision for a baseline entry (LARC — supervised learning)
63% precision for a model which can identify new datasets outside of the training set (KAIST — unsupervised learning)t

Instead of mixing several goals in the competition, now we structure the overall process as a workflow: separation of concerns allows the results of one stage of ML modeling to feed into subsequent stages, while the performance of each stage can be measured and analyzed relatively independently. The identification of fine-grained details get deferred to later stages of ML modeling: author reconciliation, keyword mesh reconciliation, distinguishing dataset distributions (point releases, year ranges, geographic distributions, etc.), link prediction to impute missing metadata, and so on.

This workflow architecture allows for feedback loops, e.g., opportunities to incorporate multiple forms of human-in-the-loop (HITL) interaction to improve the KG. The workflow architecture is also much better suited to leverage distributed systems at scale, as the scope of Rich Context expands to include multiple agencies and research areas.

Outcomes from the Rich Context Workshop

We were fortunate to have several ML practitioners with the appropriate expertise participating in the workshop and finding ways to collaborate on Rich Context. The following people have already been developing models, reviewing the corpus, helping refine evaluation metrics and test process, contributing code to support the competition, and so on: Philips Kokoh Prasetyo @ LARC; Haritz Puerto and Giwon Hong @ KAIST; Daniel Vila Suero @ Recognai; John Kaufhold @ Deep Learning Analytics; John Bohannon @ Primer AI. We will be working closely with them, notifying each of them as new entities get added into our KG and the corpus size and dimension ramps up, so that we continue to adapt and evaluation new ML models for Rich Context needs.

Another point that teams stressed is about the importance of structuring the text extracted from publication PDFs. Whether text comes from a “Methods” section versus a “Conclusion” section has significant bearing on feature engineering for the ML models, i.e., ultimately how well datasets can be identified. We’re working with these teams to develop a common library (open source) for extracting semi-structured text from publication PDFs to improve this feature engineering, and meanwhile using SPv2.

The first component of our ML leaderboard competition has been to rework the challenge of identifying datasets from publications. Another upcoming components will evaluate different approaches for recommender systems, to leverage our KG and produce results for the end users of Rich Context. Initially those recsys results will be handled as cached, pre-computed recommendations imported into the Data Stewardship module of the ADRF, closely followed by support for automated data inventories that agencies provide to the general public.

Note that research in natural language has undergone significant transformations since early 2018, due to the rise of embedded language models and “transformers”, e.g., the popularity of ELMo, BERT, XLNet, GPT-2, etc. At the time of the first RCC competition, there had been less than a year for these approaches based on deep learning model architectures to be understood. We hope that by hosting an ongoing public leaderboard competition with an open corpus we may engage a broader range of researchers working with state-of-the-art ML modeling, as natural language advances continue to accelerate. In particular, the people noted above are experts in this area, and their contributions will help initiate more advanced approaches to modeling.

Knowledge Graph

Ultimately our work developing a knowledge graph feeds into recommender systems to produce results for the end users of Rich Context. One side benefit is that this process also helps identify metadata errors in other systems used along the way, through which we can suggest corrections to benefit the research community in general.

Platforms exist which provide discovery services and metadata related to datasets. For example, Google Dataset Search helps researchers locate online datasets and offers metadata such as dataset provider links. However, that approach does not surface the links between datasets and publications, which is foundational for our KG work in Rich Context.

Other discovery services such as the Scholix initiative (including Crossref, DataCite, OpenAIRE) provide metadata exchange about research objects (e.g., some links between scholarly literature and data) although their coverage is limited. Note that even in cases where there is good coverage for the published research in a particular field, these APIs do not provide the same coverage — such that one must “mix&match” results. To-date there is no one discovery service for searching the links among datasets, publications, people, topics, etc., in general.

Representation Strategy

A more robust approach for identifying metadata about linked data becomes a blend of four complementary strategies:

Running a small, well-trained team to perform manual review of publications, extracting and curating metadata about datasets, authors, research areas, etc., to supplement the metadata available via discovery services. This approach establishes ground truth for the KG, and also allows us to incorporate HITL approaches to resolve metadata errors (manual override).
Federating searches across multiple discovery service APIs, applying custom business logic to reconcile disagreements among their results. These results can then augment and expand upon the manual curation efforts described above.
Leveraging the public ML competition to research how to use ML models to infer metadata from research publications, and other ML challenges identified in our KG workflow.
Using human-in-the-loop (HITL) feedback from authors and other experts to confirm/reject/augment the metadata which ML models have inferred.

Manual Curation Team

Our focus throughout 2019 Q3-Q4 has been to build a small team at NYU-CI responsible for ingesting metadata into the KG and performing the required manual curation.

We’ve established an internal process for 5–10 simultaneous people working, plus training new staff.
They leverage discovery services to search for publications associated with specific datasets, and confirm by reading the papers.
They use Git to manage their collaborative process (updates managed in private branches, pull requests, code reviews, GitHub issues, tagged releases, etc.) and develop unit tests for coverage of edge cases and identified failure modes, to ensure data quality.
We’ve started a more formal, longer-term review process for dataset definitions, names, etc., since that remains somewhat of an art and not fully a science.
We’re beginning to leverage crosswalks and other resources that become available, especially as researchers share their materials (e.g., postdocs who have personal “maps” of key datasets).

Sources which this team use for building out the KG include:

When new agency datasets get ingested into the ADRF.
Working with agency library systems to incorporate their metadata catalogs.
Reusing linked data from the first RCC competition (5000 publications) which has been validated manually.

Federated APIs

To federate searches across multiple discovery service APIs we’ve developed the open source richcontext.scholapi Python library. This library fills a gap to help automate metadata exchange among the platforms for scholarly infrastructure and their APIs. See the https://github.com/NYU-CI/RCApi repo. This open source project serves multiple functions:

We use this library at several points within our workflow to expand our KG.
We invite collaborations with the discovery services, other developers, and so on.
It establishes a location for collecting custom business logic to reconcile disagreements among API results.
Potentially, we can use this to provide feedback and metadata updates/corrections to the different discovery services.

Current API integrations in the library include: Unpaywall, dissemin, Crossref, PubMed, EuropePMC, OpenAIRE, RePEc, Dimensions, and Semantic Scholar. We plan to integrate several others soon, including DataCite and Scholexploer — plus parsing metadata by NIH and others that include structured metadata in their search results.

Through regular use of this library, we have been collecting telemetry about the performance of these discovery services, such as search success rates, query response times, etc. We are developing a paper about comparative analysis and benchmarking for these APIs.

Workflow Architecture

The manual and automated steps in our process have been integrated into a workflow architecture. This allows for separation of concerns, better use of distributed systems to scale the required processing, plus more effective software engineering practices such as continuous integration (CI). Each of the primary entities in the KG is managed within its own Git repository. Then a centralized graph/workflow Git repository coordinates those other repositories as submodules. See https://github.com/NYU-CI/RCGraph

One of the more important aspects of this workflow architecture is to provide affordances for manual override. Given that, we can integrate the APIs for several less-than-perfect discovery services and still produce reliable results. Also, we’re identifying what to send back to the discovery services as suggested corrections when they are re-publishing metadata with errors.

In terms of leveraging distributed systems for parallel processing, our work has benefited greatly by integrating Ray. That fit well into our Python-base tech stack, requiring only a few additional lines of code for integration. The unification of the actor abstraction with the task-parallel abstraction provides a power yet succinct means for parallelizing Python workflows for machine learning. See the excellent article “Ray for the Curious” by Dean Wampler for more details and coding examples.

Current Corpus

Our current KG has incorporated metadata from nearly 100 different projects at over a dozen participating agencies, initially as preparations for the ADRF classes:

~4000 publications linked to datasets
~600 datasets formally described
~300 providers formally described
~1000 journals formally described

As of 2020 Q1 we’re now beginning to include authors, keywords, projects, stewards, and other entities into this graph. Whenever possible, we leverage persistent identifiers for these entities:

dataset provider: ROR
journal: ISSN
research publication: DOI
author: ORCID

See github.com/Coleridge-Initiative/rclc/wiki/Corpus-Description for more details.

The latest release of the public version of our corpus (v.0.1.8, 2020–01–03) includes ~1500 publications linked to datasets which have open access PDFs available. By definition, the public version is a subset of the overall KG, since it has the additional constraints that each publication much have an open access PDF available.

Human-In-The-Loop

A partnership with RePEc is in progress, where metadata inferred by ML models can be confirmed, rejected, or augmented directly by the authors of specific research publications. See the “Human-in-the-loop AI for Scholarly Infrastructure” article and also “New initiative to help with discovery of dataset use in scholarly work” by Christian Zimmerman on the RePEc blog for details.

This work is currently collecting responses from authors, which will we use for a semi-supervised learning feedback loop to improve the corpus and ML models in the competition.

Other aspects of HITL also fit into Rich Context. These include:

Our manual curation team and manual override of metadata from discovery services.
ML competition teams identifying and suggesting corrections to the corpus via GitHub.
Feedback and annotations collected from researchers using ADRF.
The public engaged through automated data inventories for agencies (“citizen scientists”).

We will continue to adapt other HITL feedback mechanisms into the KG work for Rich Context, and meanwhile collaborate with the discovery services to help update their metadata sources.

Other Integrations

We are evaluating whether we could provide a Scholix hub for social science. While there are already Scholix hubs for life sciences, physical science, etc., a gap exists for social science research and Rich Context could fill that. Mostly this requires:

publishing our KG according to the Scholix standard for metadata exchange
participation in the Scholix community and evolving standards

To that end, at the Rich Context Workshop we had several of the key experts who involved in discovery services and scholarly infrastructure. We will be collaborating with them to:

Integrate with more discovery services via our open source library, to obtain better metadata for enriching our KG.
Establish means for sending metadata corrections into other discovery services.

To provide feedback on the outcomes from the Rich Context Workshop, suggest research topics, or propose potential collaborations, please use this Google Form: https://forms.gle/AhAtHrzBNNBoGVeM9

Recent Coverage

Meanwhile, here are some recent articles and presentation about Rich Context:

“Evidence-based decision making: What DOE, USDA and others are learning”
Wyatt Kash
FedScoop (2019–06–28)
“Empty rhetoric over data sharing slows science”
Nature (2017–06–12)
“Experiences of the Deutsche Bundesbank”
Stefan Bender
CEMLA (2019–05–28)
“Where’s Waldo: Finding datasets in empirical research publications”
Julia Lane
AKBC (2019–05–22)
“Google data set search”
Ian Mulvany
ScholCommsBlog (2019–11–19)
“Impact for social science researchers”
Ian Mulvany
FORCE11 (2019–11–17)

Stay tuned for more updates about Rich Context throughout 2020!