Searching Related Mathematical Expressions with Machine Learning

Logo

We scrape mathematical expressions from arxiv.org to train a machine learning model that can find similar formulas in millions of scientific articles.

View the Project on GitHub Whadup/arxiv_learning

Arxiv Learning

We propose to use unsupervised representation learning techniques to search for related mathematical expressions on arxiv.org. A demo is running at:
https://heureka2.azurewebsites.net.

Preprocessing

We have published our data processing pipeline as a standalone python library at https://github.com/Whadup/arxiv_library.

Datasets

Preprocessed Data

Password protected, but available here: Sciebo-Link. Just reach out for the password.

Keyword-Annotated Formulas

in this shared LaTeX document, we collect keyword-annotated formulas: Overleaf. We can query these formulas in a large collection of papers and check if the keywords appear in the context of the search results. A processed version of this document is available here: eval.json.

We also have a small set of machine-learning related formulas labeled into categories. Labeled Data. Warning: These formulas are copied from arxiv papers, often multiple formulas belong to the same paper and we do not have meta data to reconstruct the source. When pre-training on a large collection, it is thus likely that these test formulas have been seen during training, possibly even as positive pairs in contrasitve learning tasks.

Finetuning Data

We have automatically identified equalities and inequalities on arXiv. Now the machine learning task is to learn to match left-hand-sides and right-hand-sides of these (in-)equalities. We provide three different fine-tuning datasets, each split into train and test files. See finetune_model.py#L72 for an example of how to evaluate the model with these datasets.

The fine-tuning data is based on the arxiv publications listed in the meta-datafile. When using the fine-tuning data in a downstream evaluation, make sure that your base-model is not trained on the same papers. For sake of completeness, we have included the list of papers we used for pretraining.

Finetuning Results

We report finetuning results for different kinds of models, measuring recall@K.

Model Equalities Mixed Operators Inequalities
R@1 R@10 R@100 R@1 R@10 R@100 R@1 R@10 R@100
FastText 0.46 0.64 0.73 0.47 0.63 0.73 0.48 0.70 0.80
GraphCNN 0.51 0.83 0.88 0.51 0.83 0.88 0.50 0.87 0.92
Transformer small 0.54 0.77 0.90 0.54 0.76 0.87 0.50 0.82 0.96
Transformer medium 0.53
0.76 0.90 0.54 0.8 0.95 0.52 0.87 0.98
Transformer large 0.58 0.82 0.94 0.59 0.81 0.94 0.55 0.90 0.99

References