Arxiv Learning

We propose to use unsupervised representation learning techniques to search for related mathematical expressions on arxiv.org. A demo is running at:
https://heureka2.azurewebsites.net.

Preprocessing

We have published our data processing pipeline as a standalone python library at https://github.com/Whadup/arxiv_library.

Datasets

Preprocessed Data

Password protected, but available here: Sciebo-Link. Just reach out for the password.

Keyword-Annotated Formulas

in this shared LaTeX document, we collect keyword-annotated formulas: Overleaf. We can query these formulas in a large collection of papers and check if the keywords appear in the context of the search results. A processed version of this document is available here: eval.json.

We also have a small set of machine-learning related formulas labeled into categories. Labeled Data. Warning: These formulas are copied from arxiv papers, often multiple formulas belong to the same paper and we do not have meta data to reconstruct the source. When pre-training on a large collection, it is thus likely that these test formulas have been seen during training, possibly even as positive pairs in contrasitve learning tasks.

Finetuning Data

We have automatically identified equalities and inequalities on arXiv. Now the machine learning task is to learn to match left-hand-sides and right-hand-sides of these (in-)equalities. We provide three different fine-tuning datasets, each split into train and test files. See finetune_model.py#L72 for an example of how to evaluate the model with these datasets.

Equalities (=): Train and Test
Inequalities (< and ≤ ): Train and Test
Mixed Operators (=<≤>≥): Train and Test

The fine-tuning data is based on the arxiv publications listed in the meta-datafile. When using the fine-tuning data in a downstream evaluation, make sure that your base-model is not trained on the same papers. For sake of completeness, we have included the list of papers we used for pretraining.

Fine-Tuning Papers: Metadata
Pre-Training Papers: Metadata

Finetuning Results

We report finetuning results for different kinds of models, measuring recall@K.

Model	Equalities			Mixed Operators			Inequalities
Model	R@1	R@10	R@100	R@1	R@10	R@100	R@1	R@10	R@100
FastText	0.46	0.64	0.73	0.47	0.63	0.73	0.48	0.70	0.80
GraphCNN	0.51	0.83	0.88	0.51	0.83	0.88	0.50	0.87	0.92
Transformer small	0.54	0.77	0.90	0.54	0.76	0.87	0.50	0.82	0.96
Transformer medium	0.53	0.76	0.90	0.54	0.8	0.95	0.52	0.87	0.98
Transformer large	0.58	0.82	0.94	0.59	0.81	0.94	0.55	0.90	0.99

References

Lukas Pfahler and Katharina Morik. “Semantic Search in Millions of Equations”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2020. Paper
Stefan Todorinski, “Erkennung von Ähnlichkeiten zwischen mathematischen Ausdrücken mittels Bidirectional Encoder Representations from Transformers”, Master Thesis, Dortmund, 2021.
Jonathan Schill, “Scaling up the Equation-Encoder - Handling High Data Volume through the Efficient Use of Trainable Parameters”, Bachelor Thesis, Dortmund, 2020. Paper
Lukas Pfahler, Jonathan Schill, and Katharina Morik. “The Search for Equations–Learning to Identify Similarities between Mathematical Expressions.” Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2019. Paper