giantsteps_tempo

Smoothed Tempo Distribution

Figure 1: Percentage of values in tempo interval.

Tag Distribution for ‘tag_open’

Figure 2: Percentage of tracks tagged with tags from namespace ‘tag_open’. Annotations are from reference 1.0.

Estimates for ‘giantsteps_tempo’

Estimators

boeck2015/tempodetector2016_default

Attribute	Value
Corpus	giantsteps_tempo
Version	0.17.dev0
Annotation Tools	TempoDetector.2016, madmom, https://github.com/CPJKU/madmom
Annotator, bibtex	Boeck2015

boeck2019/multi_task

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.1
Annotation Tools	model=multi_task, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2019/multi_task_hjdb

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.1
Annotation Tools	model=multi_task_hjdb, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2020/dar

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.1
Annotation Tools	https://github.com/superbock/ISMIR2020
Annotator, bibtex	Boeck2020

davies2009/mirex_qm_tempotracker

Attribute	Value
Corpus	giantsteps_tempo
Version	1.0
Annotation Tools	QM Tempotracker, Sonic Annotator plugin. https://code.soundsoftware.ac.uk/projects/mirex2013/repository/show/audio_tempo_estimation/qm-tempotracker Note that the current macOS build of ‘qm-vamp-plugins’ was used.
Annotator, bibtex	Davies2009	Davies2007

percival2014/stem

Attribute	Value
Corpus	giantsteps_tempo
Version	1.0
Annotation Tools	percival 2014, ‘tempo’ implementation from Marsyas, http://marsyas.info, git checkout tempo-stem
Annotator, bibtex	Percival2014

schreiber2014/default

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.1
Annotation Tools	schreiber 2014, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2014

schreiber2017/ismir2017

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.4
Annotation Tools	schreiber 2017, model=ismir2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2017/mirex2017

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.4
Annotation Tools	schreiber 2017, model=mirex2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2018/cnn

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=cnn), https://github.com/hendriks73/tempo-cnn

schreiber2018/fcn

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=fcn), https://github.com/hendriks73/tempo-cnn

schreiber2018/ismir2018

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=ismir2018), https://github.com/hendriks73/tempo-cnn

sun2021/default

Attribute	Value
Corpus	giantsteps_tempo
Version	0.0.2
Data Source	Xiaoheng Sun, Qiqi He, Yongwei Gao, Wei Li. Musical Tempo Estimation Using a Multi-scale Network. in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., Online, 2021
Annotation Tools	https://github.com/Qqi-HE/TempoEstimation_MGANet
Annotator, bibtex	Sun2021

Basic Statistics

Estimator	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
boeck2015/tempodetector2016_default	664	41.10	214.29	112.49	37.34	74.00	0.70
boeck2019/multi_task	664	57.49	201.91	118.32	26.41	76.00	0.92
boeck2019/multi_task_hjdb	664	58.62	201.92	124.61	27.13	83.00	0.89
boeck2020/dar	664	47.65	201.50	134.27	28.64	91.00	0.86
davies2009/mirex_qm_tempotracker	664	63.80	191.41	115.46	27.09	84.00	0.89
percival2014/stem	664	54.98	160.25	106.59	27.22	75.00	0.85
schreiber2014/default	664	63.03	171.91	112.05	27.01	78.00	0.87
schreiber2017/ismir2017	664	63.50	176.01	118.57	27.31	80.00	0.88
schreiber2017/mirex2017	664	69.00	190.05	132.45	26.88	88.00	0.86
schreiber2018/cnn	664	67.00	216.00	135.25	27.24	90.00	0.88
schreiber2018/fcn	664	63.00	201.00	138.38	26.85	90.00	0.90
schreiber2018/ismir2018	664	69.00	186.00	129.66	26.96	89.00	0.84
sun2021/default	664	58.00	211.00	139.16	27.57	94.00	0.90

Table 2: Basic statistics.

Smoothed Tempo Distribution

Figure 3: Percentage of values in tempo interval.

Accuracy

Accuracy₁ is defined as the percentage of correct estimates, allowing a 4% tolerance for individual BPM values.

Accuracy₂ additionally permits estimates to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave errors).

See [Gouyon2006].

Note: When comparing accuracy values for different algorithms, keep in mind that an algorithm may have been trained on the test set or that the test set may have even been created using one of the tested algorithms.

Accuracy Results for 1.0

Estimator	Accuracy1	Accuracy2
sun2021/default	0.8042	0.9096
schreiber2018/fcn	0.7982	0.9066
schreiber2017/mirex2017	0.7696	0.8916
boeck2020/dar	0.7651	0.8916
schreiber2018/cnn	0.7636	0.8931
schreiber2018/ismir2018	0.7349	0.8946
boeck2019/multi_task_hjdb	0.6913	0.8931
boeck2019/multi_task	0.6431	0.8886
schreiber2017/ismir2017	0.6235	0.8870
boeck2015/tempodetector2016_default	0.5798	0.8630
schreiber2014/default	0.5663	0.8690
percival2014/stem	0.5151	0.8855
davies2009/mirex_qm_tempotracker	0.2666	0.4639

Table 3: Mean accuracy of estimates compared to version 1.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 1.0

Figure 4: Mean Accuracy₁ for estimates compared to version 1.0 depending on tolerance.

Accuracy₂ for 1.0

Figure 5: Mean Accuracy₂ for estimates compared to version 1.0 depending on tolerance.

Accuracy Results for 2.0

Estimator	Accuracy1	Accuracy2
schreiber2018/fcn	0.8986	0.9803
sun2021/default	0.8911	0.9849
boeck2020/dar	0.8744	0.9697
schreiber2018/cnn	0.8638	0.9758
schreiber2017/mirex2017	0.8260	0.9607
schreiber2018/ismir2018	0.8245	0.9758
boeck2019/multi_task_hjdb	0.7670	0.9622
boeck2019/multi_task	0.7005	0.9622
schreiber2017/ismir2017	0.6899	0.9561
boeck2015/tempodetector2016_default	0.6384	0.9410
schreiber2014/default	0.6051	0.9319
percival2014/stem	0.5703	0.9607
davies2009/mirex_qm_tempotracker	0.3540	0.5325

Table 4: Mean accuracy of estimates compared to version 2.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 2.0

Figure 6: Mean Accuracy₁ for estimates compared to version 2.0 depending on tolerance.

Accuracy₂ for 2.0

Figure 7: Mean Accuracy₂ for estimates compared to version 2.0 depending on tolerance.

Differing Items

For which items did a given estimator not estimate a correct value with respect to a given ground truth? Are there items which are either very difficult, not suitable for the task, or incorrectly annotated and therefore never estimated correctly, regardless which estimator is used?

Differing Items Accuracy₁

Items with different tempo annotations (Accuracy₁, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (279 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1092771.LOFI’ ‘1177875.LOFI’ ‘1198571.LOFI’ ‘1234668.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ … CSV

1.0 compared with boeck2019/multi_task (237 differences): ‘1068430.LOFI’ ‘1120171.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (205 differences): ‘1068430.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ … CSV

1.0 compared with boeck2020/dar (156 differences): ‘1240672.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (487 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1084996.LOFI’ ‘1092771.LOFI’ ‘1114156.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1171800.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1183908.LOFI’ … CSV

1.0 compared with percival2014/stem (322 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1092771.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ … CSV

1.0 compared with schreiber2014/default (288 differences): ‘1068430.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ … CSV

1.0 compared with schreiber2017/ismir2017 (250 differences): ‘1068430.LOFI’ ‘1120171.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ … CSV

1.0 compared with schreiber2017/mirex2017 (153 differences): ‘1234668.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1317507.LOFI’ ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1479462.LOFI’ ‘172384.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ … CSV

1.0 compared with schreiber2018/cnn (157 differences): ‘1114156.LOFI’ ‘1240669.LOFI’ ‘1240672.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ … CSV

1.0 compared with schreiber2018/fcn (134 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1765409.LOFI’ ‘1885798.LOFI’ ‘1905592.LOFI’ ‘1950701.LOFI’ … CSV

1.0 compared with schreiber2018/ismir2018 (176 differences): ‘1068430.LOFI’ ‘1327052.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1765409.LOFI’ ‘1817444.LOFI’ ‘1855660.LOFI’ ‘1905591.LOFI’ … CSV

1.0 compared with sun2021/default (130 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1461087.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1765409.LOFI’ ‘1839656.LOFI’ ‘1874244.LOFI’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (239 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1092771.LOFI’ ‘1177875.LOFI’ ‘1198571.LOFI’ ‘1234668.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ … CSV

2.0 compared with boeck2019/multi_task (198 differences): ‘1068430.LOFI’ ‘1120171.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1424458.LOFI’ ‘1569136.LOFI’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (154 differences): ‘1068430.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1317507.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1728723.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1765409.LOFI’ … CSV

2.0 compared with boeck2020/dar (83 differences): ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1698047.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1874244.LOFI’ ‘2422602.LOFI’ ‘2734649.LOFI’ ‘28952.LOFI’ ‘3069960.LOFI’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (427 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1084996.LOFI’ ‘1092771.LOFI’ ‘1114156.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1171800.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1183908.LOFI’ … CSV

2.0 compared with percival2014/stem (284 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1092771.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ … CSV

2.0 compared with schreiber2014/default (261 differences): ‘1068430.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ … CSV

2.0 compared with schreiber2017/ismir2017 (205 differences): ‘1068430.LOFI’ ‘1120171.LOFI’ ‘1234668.LOFI’ ‘1234669.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1623443.LOFI’ ‘1676961.LOFI’ ‘1728723.LOFI’ … CSV

2.0 compared with schreiber2017/mirex2017 (115 differences): ‘1234668.LOFI’ ‘1240669.LOFI’ ‘1317507.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1461087.LOFI’ ‘1698047.LOFI’ ‘172384.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1855660.LOFI’ … CSV

2.0 compared with schreiber2018/cnn (90 differences): ‘1114156.LOFI’ ‘1240669.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1698047.LOFI’ ‘1743969.LOFI’ ‘1874244.LOFI’ ‘1905591.LOFI’ ‘2083969.LOFI’ ‘2757093.LOFI’ ‘3023605.LOFI’ … CSV

2.0 compared with schreiber2018/fcn (67 differences): ‘1240672.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1698047.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1885798.LOFI’ ‘2083969.LOFI’ ‘2741734.LOFI’ ‘3013673.LOFI’ ‘3023605.LOFI’ … CSV

2.0 compared with schreiber2018/ismir2018 (116 differences): ‘1240672.LOFI’ ‘1418652.LOFI’ ‘1698047.LOFI’ ‘1747518.LOFI’ ‘1817444.LOFI’ ‘1855660.LOFI’ ‘1905591.LOFI’ ‘1929611.LOFI’ ‘2706792.LOFI’ ‘2734649.LOFI’ ‘2734862.LOFI’ … CSV

2.0 compared with sun2021/default (72 differences): ‘1240672.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1698047.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1839656.LOFI’ ‘1874244.LOFI’ ‘1950701.LOFI’ ‘210560.LOFI’ ‘3013772.LOFI’ … CSV

None of the estimators estimated the following 5 items ‘correctly’ using Accuracy₁: ‘3630279.LOFI’ ‘3980001.LOFI’ ‘5089294.LOFI’ ‘5137154.LOFI’ ‘942357.LOFI’ CSV

Differing Items Accuracy₂

Items with different tempo annotations (Accuracy₂, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (91 differences): ‘1240669.LOFI’ ‘1327052.LOFI’ ‘1479462.LOFI’ ‘1698047.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1905592.LOFI’ ‘1950701.LOFI’ ‘2726350.LOFI’ … CSV

1.0 compared with boeck2019/multi_task (74 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1424458.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1905592.LOFI’ ‘2422602.LOFI’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (71 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1905592.LOFI’ ‘2039114.LOFI’ ‘2422602.LOFI’ … CSV

1.0 compared with boeck2020/dar (72 differences): ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1747518.LOFI’ ‘1905592.LOFI’ ‘2422602.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (356 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1084996.LOFI’ ‘1092771.LOFI’ ‘1114156.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1171800.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1183908.LOFI’ … CSV

1.0 compared with percival2014/stem (76 differences): ‘1030011.LOFI’ ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1424458.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1874244.LOFI’ ‘1905592.LOFI’ ‘2432724.LOFI’ … CSV

1.0 compared with schreiber2014/default (87 differences): ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1479462.LOFI’ ‘1676961.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905592.LOFI’ ‘1974485.LOFI’ ‘2676506.LOFI’ ‘2726350.LOFI’ … CSV

1.0 compared with schreiber2017/ismir2017 (75 differences): ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905592.LOFI’ ‘1974485.LOFI’ ‘2422602.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ … CSV

1.0 compared with schreiber2017/mirex2017 (72 differences): ‘1327052.LOFI’ ‘1380256.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905592.LOFI’ ‘1974485.LOFI’ ‘2422602.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ … CSV

1.0 compared with schreiber2018/cnn (71 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1874244.LOFI’ ‘1905592.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ ‘2726355.LOFI’ … CSV

1.0 compared with schreiber2018/fcn (62 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905592.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ ‘2726355.LOFI’ ‘2745205.LOFI’ … CSV

1.0 compared with schreiber2018/ismir2018 (70 differences): ‘1068430.LOFI’ ‘1327052.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905591.LOFI’ ‘1905592.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ ‘2726355.LOFI’ … CSV

1.0 compared with sun2021/default (60 differences): ‘1327052.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1479462.LOFI’ ‘1728723.LOFI’ ‘1735621.LOFI’ ‘1743969.LOFI’ ‘1905592.LOFI’ ‘2726350.LOFI’ ‘2726353.LOFI’ ‘2726355.LOFI’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (39 differences): ‘1240669.LOFI’ ‘1418652.LOFI’ ‘1747518.LOFI’ ‘3069960.LOFI’ ‘3312045.LOFI’ ‘3377892.LOFI’ ‘3414605.LOFI’ ‘3480108.LOFI’ ‘3509304.LOFI’ ‘3564559.LOFI’ ‘3630279.LOFI’ … CSV

2.0 compared with boeck2019/multi_task (25 differences): ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1424458.LOFI’ ‘1747518.LOFI’ ‘2422602.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (25 differences): ‘1329955.LOFI’ ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1747518.LOFI’ ‘2039114.LOFI’ ‘2422602.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ … CSV

2.0 compared with boeck2020/dar (20 differences): ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1747518.LOFI’ ‘2422602.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3801596.LOFI’ ‘4120277.LOFI’ ‘4237913.LOFI’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (309 differences): ‘1030011.LOFI’ ‘1068430.LOFI’ ‘1084996.LOFI’ ‘1092771.LOFI’ ‘1114156.LOFI’ ‘1118326.LOFI’ ‘1120171.LOFI’ ‘1171800.LOFI’ ‘1174239.LOFI’ ‘1177875.LOFI’ ‘1183908.LOFI’ … CSV

2.0 compared with percival2014/stem (26 differences): ‘1030011.LOFI’ ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1424458.LOFI’ ‘1874244.LOFI’ ‘2432724.LOFI’ ‘3435022.LOFI’ ‘3480108.LOFI’ ‘3565815.LOFI’ ‘3711752.LOFI’ ‘3787878.LOFI’ … CSV

2.0 compared with schreiber2014/default (45 differences): ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1676961.LOFI’ ‘1728723.LOFI’ ‘1974485.LOFI’ ‘2676506.LOFI’ ‘2759853.LOFI’ ‘3013772.LOFI’ ‘3023605.LOFI’ ‘3069960.LOFI’ ‘3128068.LOFI’ … CSV

2.0 compared with schreiber2017/ismir2017 (29 differences): ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1974485.LOFI’ ‘2422602.LOFI’ ‘3023605.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3692859.LOFI’ ‘3787878.LOFI’ … CSV

2.0 compared with schreiber2017/mirex2017 (26 differences): ‘1380256.LOFI’ ‘1418652.LOFI’ ‘1974485.LOFI’ ‘2422602.LOFI’ ‘3023605.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ ‘4017611.LOFI’ … CSV

2.0 compared with schreiber2018/cnn (16 differences): ‘1329955.LOFI’ ‘1418652.LOFI’ ‘1874244.LOFI’ ‘3023605.LOFI’ ‘3069960.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ ‘4044591.LOFI’ ‘4283854.LOFI’ … CSV

2.0 compared with schreiber2018/fcn (13 differences): ‘1329955.LOFI’ ‘1418652.LOFI’ ‘3023605.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ ‘4043892.LOFI’ ‘4288893.LOFI’ ‘4332592.LOFI’ ‘5137153.LOFI’ … CSV

2.0 compared with schreiber2018/ismir2018 (16 differences): ‘1418652.LOFI’ ‘1905591.LOFI’ ‘3069960.LOFI’ ‘3189712.LOFI’ ‘3480108.LOFI’ ‘3630279.LOFI’ ‘3642438.LOFI’ ‘3787878.LOFI’ ‘4043892.LOFI’ ‘4044591.LOFI’ ‘4288893.LOFI’ … CSV

2.0 compared with sun2021/default (10 differences): ‘1329955.LOFI’ ‘1418652.LOFI’ ‘3069960.LOFI’ ‘3630279.LOFI’ ‘3787878.LOFI’ ‘4264210.LOFI’ ‘4288893.LOFI’ ‘5137153.LOFI’ ‘5137154.LOFI’ ‘906760.LOFI’ CSV

All tracks were estimated ‘correctly’ by at least one system.

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0011	0.0000	0.0000	0.0000	0.0026	0.5938	0.0374	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0011	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.2084	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0068	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.8453	1.0000	0.0189	0.0624	0.0034
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0026	0.0000	0.0000	0.0000	0.0000	1.0000	0.0047	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.5938	0.0000	0.0000	0.0000	0.0000	0.0047	1.0000	0.0009	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0374	0.2084	0.0000	0.0000	0.0000	0.0000	0.0009	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.8453	0.0000	0.0000	0.0000	0.0000	1.0000	0.7728	0.0648	0.0255	0.0368
schreiber2018/cnn	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.7728	1.0000	0.0044	0.0536	0.0048
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0189	0.0000	0.0000	0.0000	0.0000	0.0648	0.0044	1.0000	0.0000	0.6985
schreiber2018/ismir2018	0.0000	0.0000	0.0068	0.0624	0.0000	0.0000	0.0000	0.0000	0.0255	0.0536	0.0000	1.0000	0.0000
sun2021/default	0.0000	0.0000	0.0000	0.0034	0.0000	0.0000	0.0000	0.0000	0.0368	0.0048	0.6985	0.0000	1.0000

Table 5: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0022	0.0000	0.0000	0.0000	0.0022	0.1716	0.0172	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0022	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.5248	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0007	0.0000	0.0000	0.0004	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0029	0.5102	0.0976	0.0016	0.2418
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0022	0.0000	0.0000	0.0000	0.0000	1.0000	0.0620	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.1716	0.0000	0.0000	0.0000	0.0000	0.0620	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0172	0.5248	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0007	0.0029	0.0000	0.0000	0.0000	0.0000	1.0000	0.0215	0.0000	1.0000	0.0001
schreiber2018/cnn	0.0000	0.0000	0.0000	0.5102	0.0000	0.0000	0.0000	0.0000	0.0215	1.0000	0.0059	0.0099	0.0700
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0976	0.0000	0.0000	0.0000	0.0000	0.0000	0.0059	1.0000	0.0000	0.6198
schreiber2018/ismir2018	0.0000	0.0000	0.0004	0.0016	0.0000	0.0000	0.0000	0.0000	1.0000	0.0099	0.0000	1.0000	0.0000
sun2021/default	0.0000	0.0000	0.0000	0.2418	0.0000	0.0000	0.0000	0.0000	0.0001	0.0700	0.6198	0.0000	1.0000

Table 6: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0076	0.0012	0.0034	0.0000	0.0444	0.6936	0.0166	0.0043	0.0022	0.0000	0.0015	0.0000
boeck2019/multi_task	0.0076	1.0000	0.5078	0.7744	0.0000	0.8555	0.0789	1.0000	0.8388	0.6636	0.0118	0.5413	0.0005
boeck2019/multi_task_hjdb	0.0012	0.5078	1.0000	1.0000	0.0000	0.4731	0.0293	0.5235	1.0000	1.0000	0.0784	1.0000	0.0127
boeck2020/dar	0.0034	0.7744	1.0000	1.0000	0.0000	0.6076	0.0400	0.6636	1.0000	1.0000	0.0525	0.8388	0.0118
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0444	0.8555	0.4731	0.6076	0.0000	1.0000	0.1690	1.0000	0.5966	0.4049	0.0043	0.3771	0.0025
schreiber2014/default	0.6936	0.0789	0.0293	0.0400	0.0000	0.1690	1.0000	0.0576	0.0167	0.0195	0.0001	0.0161	0.0000
schreiber2017/ismir2017	0.0166	1.0000	0.5235	0.6636	0.0000	1.0000	0.0576	1.0000	0.2500	0.5235	0.0044	0.3833	0.0015
schreiber2017/mirex2017	0.0043	0.8388	1.0000	1.0000	0.0000	0.5966	0.0167	0.2500	1.0000	1.0000	0.0213	0.8145	0.0075
schreiber2018/cnn	0.0022	0.6636	1.0000	1.0000	0.0000	0.4049	0.0195	0.5235	1.0000	1.0000	0.0117	1.0000	0.0034
schreiber2018/fcn	0.0000	0.0118	0.0784	0.0525	0.0000	0.0043	0.0001	0.0044	0.0213	0.0117	1.0000	0.0386	0.7539
schreiber2018/ismir2018	0.0015	0.5413	1.0000	0.8388	0.0000	0.3771	0.0161	0.3833	0.8145	1.0000	0.0386	1.0000	0.0129
sun2021/default	0.0000	0.0005	0.0127	0.0118	0.0000	0.0025	0.0000	0.0015	0.0075	0.0034	0.7539	0.0129	1.0000

Table 7: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0243	0.0243	0.0013	0.0000	0.0725	0.5044	0.1433	0.0533	0.0003	0.0000	0.0002	0.0000
boeck2019/multi_task	0.0243	1.0000	1.0000	0.2266	0.0000	1.0000	0.0045	0.5235	1.0000	0.0784	0.0075	0.0784	0.0001
boeck2019/multi_task_hjdb	0.0243	1.0000	1.0000	0.2266	0.0000	1.0000	0.0045	0.5235	1.0000	0.0784	0.0075	0.0636	0.0001
boeck2020/dar	0.0013	0.2266	0.2266	1.0000	0.0000	0.3616	0.0002	0.0784	0.2863	0.5034	0.1435	0.4807	0.0213
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0725	1.0000	1.0000	0.3616	0.0000	1.0000	0.0127	0.7283	1.0000	0.0639	0.0072	0.0987	0.0004
schreiber2014/default	0.5044	0.0045	0.0045	0.0002	0.0000	0.0127	1.0000	0.0052	0.0009	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.1433	0.5235	0.5235	0.0784	0.0000	0.7283	0.0052	1.0000	0.2500	0.0106	0.0001	0.0072	0.0001
schreiber2017/mirex2017	0.0533	1.0000	1.0000	0.2863	0.0000	1.0000	0.0009	0.2500	1.0000	0.0414	0.0010	0.0309	0.0004
schreiber2018/cnn	0.0003	0.0784	0.0784	0.5034	0.0000	0.0639	0.0000	0.0106	0.0414	1.0000	0.5488	1.0000	0.1460
schreiber2018/fcn	0.0000	0.0075	0.0075	0.1435	0.0000	0.0072	0.0000	0.0001	0.0010	0.5488	1.0000	0.5488	0.4531
schreiber2018/ismir2018	0.0002	0.0784	0.0636	0.4807	0.0000	0.0987	0.0000	0.0072	0.0309	1.0000	0.5488	1.0000	0.1460
sun2021/default	0.0000	0.0001	0.0001	0.0213	0.0000	0.0004	0.0000	0.0001	0.0004	0.1460	0.4531	0.1460	1.0000

Table 8: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Accuracy₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₁ on Tempo-Subsets for 1.0

Figure 8: Mean Accuracy₁ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₁ on Tempo-Subsets for 2.0

Figure 9: Mean Accuracy₁ for estimates compared to version 2.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₂ on Tempo-Subsets for 1.0

Figure 10: Mean Accuracy₂ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets for 2.0

Figure 11: Mean Accuracy₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated Accuracy₁ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₁-values and a ground truth, what Accuracy₁ can we expect with confidence?

Estimated Accuracy₁ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 1.0.

Figure 12: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₁ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 2.0.

Figure 13: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₂-values and a ground truth, what Accuracy₂ can we expect with confidence?

Estimated Accuracy₂ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 1.0.

Figure 14: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 2.0.

Figure 15: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Accuracy₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₁ for ‘tag_open’ Tags for 1.0

Figure 16: Mean Accuracy₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₁ for ‘tag_open’ Tags for 2.0

Figure 17: Mean Accuracy₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₂ for ‘tag_open’ Tags for 1.0

Figure 18: Mean Accuracy₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags for 2.0

Figure 19: Mean Accuracy₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

MIREX-Style Evaluation

P-Score is defined as the average of two tempi weighted by their perceptual strength, allowing an 8% tolerance for both tempo values [MIREX 2006 Definition].

One Correct is the fraction of estimate pairs of which at least one of the two values is equal to a reference value (within an 8% tolerance).

Both Correct is the fraction of estimate pairs of which both values are equal to the reference values (within an 8% tolerance).

See [McKinney2007].

Note: Very few datasets actually have multiple annotations per track along with a salience distributions. References without suitable annotations are not shown.

MIREX Results for 2.0

Estimator	P-Score	One Correct	Both Correct
schreiber2018/cnn	0.9636	0.9955	0.6354
schreiber2018/fcn	0.9572	0.9939	0.6157
schreiber2018/ismir2018	0.9564	0.9924	0.6142
schreiber2017/mirex2017	0.9537	0.9879	0.6384
boeck2019/multi_task_hjdb	0.9466	0.9924	0.6006
schreiber2017/ismir2017	0.9446	0.9758	0.6369
boeck2020/dar	0.9436	0.9834	0.5567
boeck2019/multi_task	0.9268	0.9834	0.5643
schreiber2014/default	0.9119	0.9531	0.6021
sun2021/default	0.8452	0.9879	0.1528
boeck2015/tempodetector2016_default	0.8442	0.9818	0.3132
percival2014/stem	0.5909	0.8956	0.0408
davies2009/mirex_qm_tempotracker	0.4060	0.5507	0.2148

Table 9: Compared to 2.0 with 8.0% tolerance.

Raw data P-Score: CSV JSON LATEX PICKLE

Raw data One Correct: CSV JSON LATEX PICKLE

Raw data Both Correct: CSV JSON LATEX PICKLE

P-Score for 2.0

Figure 20: Mean P-Score for estimates compared to version 2.0 depending on tolerance.

One Correct for 2.0

Figure 21: Mean One Correct for estimates compared to version 2.0 depending on tolerance.

Both Correct for 2.0

Figure 22: Mean Both Correct for estimates compared to version 2.0 depending on tolerance.

P-Score on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean P-Score for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

P-Score on Tempo-Subsets for 1.0

Figure 23: Mean P-Score for estimates compared to version 1.0 for tempo intervals around T.

P-Score on Tempo-Subsets for 2.0

Figure 24: Mean P-Score for estimates compared to version 2.0 for tempo intervals around T.

One Correct on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean One Correct for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

One Correct on Tempo-Subsets for 1.0

Figure 25: Mean One Correct for estimates compared to version 1.0 for tempo intervals around T.

One Correct on Tempo-Subsets for 2.0

Figure 26: Mean One Correct for estimates compared to version 2.0 for tempo intervals around T.

Both Correct on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Both Correct for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Both Correct on Tempo-Subsets for 1.0

Figure 27: Mean Both Correct for estimates compared to version 1.0 for tempo intervals around T.

Both Correct on Tempo-Subsets for 2.0

Figure 28: Mean Both Correct for estimates compared to version 2.0 for tempo intervals around T.

Estimated P-Score for Tempo

When fitting a generalized additive model (GAM) to P-Score-values and a ground truth, what P-Score can we expect with confidence?

Estimated P-Score for Tempo for 1.0

Predictions of GAMs trained on P-Score for estimates for reference 1.0.

Figure 29: P-Score predictions of a generalized additive model (GAM) fit to P-Score results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated P-Score for Tempo for 2.0

Predictions of GAMs trained on P-Score for estimates for reference 2.0.

Figure 30: P-Score predictions of a generalized additive model (GAM) fit to P-Score results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated One Correct for Tempo

When fitting a generalized additive model (GAM) to One Correct-values and a ground truth, what One Correct can we expect with confidence?

Estimated One Correct for Tempo for 1.0

Predictions of GAMs trained on One Correct for estimates for reference 1.0.

Figure 31: One Correct predictions of a generalized additive model (GAM) fit to One Correct results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated One Correct for Tempo for 2.0

Predictions of GAMs trained on One Correct for estimates for reference 2.0.

Figure 32: One Correct predictions of a generalized additive model (GAM) fit to One Correct results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Both Correct for Tempo

When fitting a generalized additive model (GAM) to Both Correct-values and a ground truth, what Both Correct can we expect with confidence?

Estimated Both Correct for Tempo for 1.0

Predictions of GAMs trained on Both Correct for estimates for reference 1.0.

Figure 33: Both Correct predictions of a generalized additive model (GAM) fit to Both Correct results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Both Correct for Tempo for 2.0

Predictions of GAMs trained on Both Correct for estimates for reference 2.0.

Figure 34: Both Correct predictions of a generalized additive model (GAM) fit to Both Correct results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

P-Score for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

P-Score for ‘tag_open’ Tags for 1.0

Figure 35: Mean P-Score of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

P-Score for ‘tag_open’ Tags for 2.0

Figure 36: Mean P-Score of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

One Correct for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

One Correct for ‘tag_open’ Tags for 1.0

Figure 37: Mean One Correct of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

One Correct for ‘tag_open’ Tags for 2.0

Figure 38: Mean One Correct of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

Both Correct for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Both Correct for ‘tag_open’ Tags for 1.0

Figure 39: Mean Both Correct of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Both Correct for ‘tag_open’ Tags for 2.0

Figure 40: Mean Both Correct of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₁ and OE₂

OE₁ is defined as octave error between an estimate E and a reference value R.This means that the most common errors—by a factor of 2 or ½—have the same magnitude, namely 1: OE₂(E) = log₂(E/R).

OE₂ is the signed OE₁ corresponding to the minimum absolute OE₁ allowing the octaveerrors 2, 3, 1/2, and 1/3: OE₂(E) = arg min_x(|x|) with x ∈ {OE₁(E), OE₁(2E), OE₁(3E), OE₁(½E), OE₁(⅓E)}

Mean OE₁/OE₂ Results for 1.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
sun2021/default	0.0294	0.3648	-0.0089	0.0856
schreiber2018/fcn	0.0231	0.3658	-0.0025	0.0874
schreiber2017/mirex2017	-0.0423	0.3815	-0.0055	0.0939
boeck2020/dar	-0.0276	0.3905	-0.0060	0.0932
schreiber2018/cnn	-0.0118	0.3910	-0.0052	0.0981
schreiber2018/ismir2018	-0.0744	0.4236	-0.0045	0.0917
boeck2019/multi_task_hjdb	-0.1351	0.4633	0.0028	0.0978
boeck2019/multi_task	-0.2110	0.4819	0.0017	0.0995
schreiber2017/ismir2017	-0.2106	0.4933	-0.0043	0.0971
schreiber2014/default	-0.2953	0.4982	-0.0056	0.1088
percival2014/stem	-0.3716	0.5215	0.0130	0.1003
davies2009/mirex_qm_tempotracker	-0.2466	0.5271	0.1751	0.2304
boeck2015/tempodetector2016_default	-0.3380	0.6087	-0.0163	0.1192

Table 10: Mean OE1/OE2 for estimates compared to version 1.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 1.0

Figure 41: OE₁ for estimates compared to version 1.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 1.0

Figure 42: OE₂ for estimates compared to version 1.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean OE₁/OE₂ Results for 2.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
schreiber2018/fcn	-0.0020	0.2935	0.0010	0.0465
boeck2020/dar	-0.0520	0.3224	-0.0018	0.0588
sun2021/default	0.0035	0.3269	-0.0019	0.0423
schreiber2018/cnn	-0.0388	0.3414	0.0005	0.0568
schreiber2017/mirex2017	-0.0676	0.3735	-0.0041	0.0648
schreiber2018/ismir2018	-0.1022	0.3833	-0.0008	0.0541
boeck2019/multi_task_hjdb	-0.1596	0.4231	0.0077	0.0659
boeck2019/multi_task	-0.2358	0.4676	0.0056	0.0682
schreiber2017/ismir2017	-0.2367	0.4709	-0.0037	0.0691
davies2009/mirex_qm_tempotracker	-0.2755	0.4871	0.1805	0.2200
percival2014/stem	-0.3978	0.4936	0.0121	0.0723
schreiber2014/default	-0.3239	0.4941	-0.0046	0.0874
boeck2015/tempodetector2016_default	-0.3658	0.5811	-0.0120	0.0920

Table 11: Mean OE1/OE2 for estimates compared to version 2.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 2.0

Figure 43: OE₁ for estimates compared to version 2.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 2.0

Figure 44: OE₂ for estimates compared to version 2.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0001	0.1542	0.0921	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0000	0.0000	0.0571	0.0000	0.0000	0.9816	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.3335	0.2459	0.0002	0.0017	0.0000
davies2009/mirex_qm_tempotracker	0.0001	0.0571	0.0000	0.0000	1.0000	0.0000	0.0080	0.0533	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.1542	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0921	0.0000	0.0000	0.0000	0.0080	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.9816	0.0000	0.0000	0.0533	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.3335	0.0000	0.0000	0.0000	0.0000	1.0000	0.0438	0.0000	0.0254	0.0000
schreiber2018/cnn	0.0000	0.0000	0.0000	0.2459	0.0000	0.0000	0.0000	0.0000	0.0438	1.0000	0.0031	0.0000	0.0035
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	0.0031	1.0000	0.0000	0.6052
schreiber2018/ismir2018	0.0000	0.0000	0.0001	0.0017	0.0000	0.0000	0.0000	0.0000	0.0254	0.0000	0.0000	1.0000	0.0000
sun2021/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0035	0.6052	0.0000	1.0000

Table 12: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0002	0.1738	0.0968	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0000	0.0000	0.0334	0.0000	0.0000	0.9492	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.3023	0.3297	0.0002	0.0007	0.0000
davies2009/mirex_qm_tempotracker	0.0002	0.0334	0.0000	0.0000	1.0000	0.0000	0.0086	0.0371	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.1738	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0968	0.0000	0.0000	0.0000	0.0086	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.9492	0.0000	0.0000	0.0371	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.3023	0.0000	0.0000	0.0000	0.0000	1.0000	0.0563	0.0000	0.0159	0.0000
schreiber2018/cnn	0.0000	0.0000	0.0000	0.3297	0.0000	0.0000	0.0000	0.0000	0.0563	1.0000	0.0017	0.0000	0.0026
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	0.0017	1.0000	0.0000	0.6498
schreiber2018/ismir2018	0.0000	0.0000	0.0002	0.0007	0.0000	0.0000	0.0000	0.0000	0.0159	0.0000	0.0000	1.0000	0.0000
sun2021/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0026	0.6498	0.0000	1.0000

Table 13: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0167	0.0000	0.0000	0.0357	0.0096	0.0233	0.0200	0.0018	0.0095	0.0870
boeck2019/multi_task	0.0000	1.0000	0.5670	0.0002	0.0000	0.0013	0.0794	0.0541	0.0263	0.0390	0.1616	0.0259	0.0002
boeck2019/multi_task_hjdb	0.0000	0.5670	1.0000	0.0004	0.0000	0.0060	0.0331	0.0334	0.0102	0.0100	0.0515	0.0027	0.0000
boeck2020/dar	0.0167	0.0002	0.0004	1.0000	0.0000	0.0000	0.9142	0.5475	0.8776	0.7724	0.2084	0.5476	0.2777
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0013	0.0060	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0357	0.0794	0.0331	0.9142	0.0000	0.0000	1.0000	0.7006	0.9942	0.9175	0.3962	0.7899	0.3801
schreiber2017/ismir2017	0.0096	0.0541	0.0334	0.5475	0.0000	0.0000	0.7006	1.0000	0.3385	0.7779	0.5219	0.9417	0.1427
schreiber2017/mirex2017	0.0233	0.0263	0.0102	0.8776	0.0000	0.0000	0.9942	0.3385	1.0000	0.8900	0.2295	0.7191	0.2396
schreiber2018/cnn	0.0200	0.0390	0.0100	0.7724	0.0000	0.0000	0.9175	0.7779	0.8900	1.0000	0.2475	0.8119	0.1152
schreiber2018/fcn	0.0018	0.1616	0.0515	0.2084	0.0000	0.0000	0.3962	0.5219	0.2295	0.2475	1.0000	0.3874	0.0093
schreiber2018/ismir2018	0.0095	0.0259	0.0027	0.5476	0.0000	0.0000	0.7899	0.9417	0.7191	0.8119	0.3874	1.0000	0.0804
sun2021/default	0.0870	0.0002	0.0000	0.2777	0.0000	0.0000	0.3801	0.1427	0.2396	0.1152	0.0093	0.0804	1.0000

Table 14: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0040	0.0000	0.0000	0.0994	0.0302	0.0398	0.0014	0.0006	0.0050	0.0045
boeck2019/multi_task	0.0000	1.0000	0.2759	0.0002	0.0000	0.0524	0.0061	0.0003	0.0002	0.0537	0.0670	0.0246	0.0005
boeck2019/multi_task_hjdb	0.0000	0.2759	1.0000	0.0000	0.0000	0.1933	0.0008	0.0000	0.0000	0.0048	0.0055	0.0016	0.0000
boeck2020/dar	0.0040	0.0002	0.0000	1.0000	0.0000	0.0001	0.4403	0.4434	0.3685	0.3747	0.2543	0.7125	0.9723
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0524	0.1933	0.0001	0.0000	1.0000	0.0000	0.0000	0.0000	0.0001	0.0002	0.0001	0.0000
schreiber2014/default	0.0994	0.0061	0.0008	0.4403	0.0000	0.0000	1.0000	0.7546	0.8456	0.1341	0.0800	0.2735	0.4313
schreiber2017/ismir2017	0.0302	0.0003	0.0000	0.4434	0.0000	0.0000	0.7546	1.0000	0.7106	0.0891	0.0289	0.2838	0.4613
schreiber2017/mirex2017	0.0398	0.0002	0.0000	0.3685	0.0000	0.0000	0.8456	0.7106	1.0000	0.0472	0.0096	0.2009	0.3438
schreiber2018/cnn	0.0014	0.0537	0.0048	0.3747	0.0000	0.0001	0.1341	0.0891	0.0472	1.0000	0.7883	0.5223	0.2255
schreiber2018/fcn	0.0006	0.0670	0.0055	0.2543	0.0000	0.0002	0.0800	0.0289	0.0096	0.7883	1.0000	0.3700	0.0789
schreiber2018/ismir2018	0.0050	0.0246	0.0016	0.7125	0.0000	0.0001	0.2735	0.2838	0.2009	0.5223	0.3700	1.0000	0.6350
sun2021/default	0.0045	0.0005	0.0000	0.9723	0.0000	0.0000	0.4313	0.4613	0.3438	0.2255	0.0789	0.6350	1.0000

Table 15: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

OE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₁ on Tempo-Subsets for 1.0

Figure 45: Mean OE₁ for estimates compared to version 1.0 for tempo intervals around T.

OE₁ on Tempo-Subsets for 2.0

Figure 46: Mean OE₁ for estimates compared to version 2.0 for tempo intervals around T.

OE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₂ on Tempo-Subsets for 1.0

Figure 47: Mean OE₂ for estimates compared to version 1.0 for tempo intervals around T.

OE₂ on Tempo-Subsets for 2.0

Figure 48: Mean OE₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated OE₁ for Tempo

When fitting a generalized additive model (GAM) to OE₁-values and a ground truth, what OE₁ can we expect with confidence?

Estimated OE₁ for Tempo for 1.0

Predictions of GAMs trained on OE₁ for estimates for reference 1.0.

Figure 49: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₁ for Tempo for 2.0

Predictions of GAMs trained on OE₁ for estimates for reference 2.0.

Figure 50: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo

When fitting a generalized additive model (GAM) to OE₂-values and a ground truth, what OE₂ can we expect with confidence?

Estimated OE₂ for Tempo for 1.0

Predictions of GAMs trained on OE₂ for estimates for reference 1.0.

Figure 51: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo for 2.0

Predictions of GAMs trained on OE₂ for estimates for reference 2.0.

Figure 52: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

OE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₁ for ‘tag_open’ Tags for 1.0

Figure 53: OE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₁ for ‘tag_open’ Tags for 2.0

Figure 54: OE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₂ for ‘tag_open’ Tags for 1.0

Figure 55: OE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags for 2.0

Figure 56: OE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₁ and AOE₂

AOE₁ is defined as absolute octave error between an estimate and a reference value: AOE₁(E) = |log₂(E/R)|.

AOE₂ is the minimum of AOE₁ allowing the octave errors 2, 3, 1/2, and 1/3: AOE₂(E) = min(AOE₁(E), AOE₁(2E), AOE₁(3E), AOE₁(½E), AOE₁(⅓E)).

Mean AOE₁/AOE₂ Results for 1.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
sun2021/default	0.1493	0.3341	0.0286	0.0811
schreiber2018/fcn	0.1504	0.3342	0.0275	0.0830
schreiber2017/mirex2017	0.1658	0.3462	0.0301	0.0891
boeck2020/dar	0.1713	0.3521	0.0339	0.0870
schreiber2018/cnn	0.1731	0.3508	0.0319	0.0930
schreiber2018/ismir2018	0.2032	0.3790	0.0295	0.0870
boeck2019/multi_task_hjdb	0.2519	0.4117	0.0340	0.0917
boeck2019/multi_task	0.2969	0.4342	0.0352	0.0930
schreiber2017/ismir2017	0.3078	0.4392	0.0318	0.0919
schreiber2014/default	0.3589	0.4546	0.0386	0.1019
boeck2015/tempodetector2016_default	0.4129	0.5606	0.0470	0.1108
percival2014/stem	0.4278	0.4765	0.0342	0.0952
davies2009/mirex_qm_tempotracker	0.4662	0.3484	0.2141	0.1946

Table 16: Mean AOE1/AOE2 for estimates compared to version 1.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 1.0

Figure 57: AOE₁ for estimates compared to version 1.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 1.0

Figure 58: AOE₂ for estimates compared to version 1.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean AOE₁/AOE₂ Results for 2.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
schreiber2018/fcn	0.0968	0.2771	0.0135	0.0446
sun2021/default	0.1127	0.3069	0.0134	0.0401
boeck2020/dar	0.1179	0.3046	0.0189	0.0556
schreiber2018/cnn	0.1299	0.3182	0.0159	0.0545
schreiber2017/mirex2017	0.1574	0.3454	0.0191	0.0621
schreiber2018/ismir2018	0.1683	0.3592	0.0154	0.0519
boeck2019/multi_task_hjdb	0.2183	0.3961	0.0205	0.0631
boeck2019/multi_task	0.2873	0.4378	0.0213	0.0650
schreiber2017/ismir2017	0.2911	0.4393	0.0207	0.0661
schreiber2014/default	0.3662	0.4636	0.0287	0.0827
boeck2015/tempodetector2016_default	0.3974	0.5599	0.0302	0.0877
percival2014/stem	0.4156	0.4787	0.0217	0.0701
davies2009/mirex_qm_tempotracker	0.4325	0.3552	0.2026	0.1998

Table 17: Mean AOE1/AOE2 for estimates compared to version 2.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 2.0

Figure 59: AOE₁ for estimates compared to version 2.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 2.0

Figure 60: AOE₂ for estimates compared to version 2.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0174	0.5153	0.0272	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.4269	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0010	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.7152	0.8927	0.1170	0.0303	0.0839
davies2009/mirex_qm_tempotracker	0.0174	0.0000	0.0000	0.0000	1.0000	0.0350	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.5153	0.0000	0.0000	0.0000	0.0350	1.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0272	0.0000	0.0000	0.0000	0.0000	0.0001	1.0000	0.0021	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.4269	0.0002	0.0000	0.0000	0.0000	0.0021	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.7152	0.0000	0.0000	0.0000	0.0000	1.0000	0.6264	0.2746	0.0083	0.2850
schreiber2018/cnn	0.0000	0.0000	0.0000	0.8927	0.0000	0.0000	0.0000	0.0000	0.6264	1.0000	0.0509	0.0291	0.0828
schreiber2018/fcn	0.0000	0.0000	0.0000	0.1170	0.0000	0.0000	0.0000	0.0000	0.2746	0.0509	1.0000	0.0001	0.9255
schreiber2018/ismir2018	0.0000	0.0000	0.0010	0.0303	0.0000	0.0000	0.0000	0.0000	0.0083	0.0291	0.0001	1.0000	0.0004
sun2021/default	0.0000	0.0000	0.0000	0.0839	0.0000	0.0000	0.0000	0.0000	0.2850	0.0828	0.9255	0.0004	1.0000

Table 18: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.1200	0.4302	0.2061	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.7826	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000	0.0000	0.0009	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0077	0.3601	0.1070	0.0005	0.6831
davies2009/mirex_qm_tempotracker	0.1200	0.0000	0.0000	0.0000	1.0000	0.3503	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.4302	0.0000	0.0000	0.0000	0.3503	1.0000	0.0036	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.2061	0.0000	0.0000	0.0000	0.0002	0.0036	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.7826	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0001	0.0077	0.0000	0.0000	0.0000	0.0000	1.0000	0.0629	0.0000	0.4387	0.0039
schreiber2018/cnn	0.0000	0.0000	0.0000	0.3601	0.0000	0.0000	0.0000	0.0000	0.0629	1.0000	0.0036	0.0056	0.2156
schreiber2018/fcn	0.0000	0.0000	0.0000	0.1070	0.0000	0.0000	0.0000	0.0000	0.0000	0.0036	1.0000	0.0000	0.1835
schreiber2018/ismir2018	0.0000	0.0000	0.0009	0.0005	0.0000	0.0000	0.0000	0.0000	0.4387	0.0056	0.0000	1.0000	0.0003
sun2021/default	0.0000	0.0000	0.0000	0.6831	0.0000	0.0000	0.0000	0.0000	0.0039	0.2156	0.1835	0.0003	1.0000

Table 19: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0009	0.0002	0.0001	0.0000	0.0020	0.0475	0.0000	0.0000	0.0001	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0009	1.0000	0.3909	0.4447	0.0000	0.7405	0.3388	0.1667	0.0436	0.1945	0.0020	0.0228	0.0027
boeck2019/multi_task_hjdb	0.0002	0.3909	1.0000	0.9727	0.0000	0.9300	0.1819	0.3765	0.1252	0.4066	0.0078	0.0610	0.0118
boeck2020/dar	0.0001	0.4447	0.9727	1.0000	0.0000	0.9172	0.1555	0.3075	0.0733	0.4135	0.0047	0.0366	0.0127
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0020	0.7405	0.9300	0.9172	0.0000	1.0000	0.2598	0.4521	0.2173	0.4194	0.0184	0.1419	0.0561
schreiber2014/default	0.0475	0.3388	0.1819	0.1555	0.0000	0.2598	1.0000	0.0154	0.0028	0.0411	0.0004	0.0058	0.0020
schreiber2017/ismir2017	0.0000	0.1667	0.3765	0.3075	0.0000	0.4521	0.0154	1.0000	0.0698	0.9566	0.0428	0.2946	0.1710
schreiber2017/mirex2017	0.0000	0.0436	0.1252	0.0733	0.0000	0.2173	0.0028	0.0698	1.0000	0.4230	0.1700	0.7367	0.4765
schreiber2018/cnn	0.0001	0.1945	0.4066	0.4135	0.0000	0.4194	0.0411	0.9566	0.4230	1.0000	0.0287	0.2110	0.1018
schreiber2018/fcn	0.0000	0.0020	0.0078	0.0047	0.0000	0.0184	0.0004	0.0428	0.1700	0.0287	1.0000	0.3123	0.5603
schreiber2018/ismir2018	0.0000	0.0228	0.0610	0.0366	0.0000	0.1419	0.0058	0.2946	0.7367	0.2110	0.3123	1.0000	0.6381
sun2021/default	0.0000	0.0027	0.0118	0.0127	0.0000	0.0561	0.0020	0.1710	0.4765	0.1018	0.5603	0.6381	1.0000

Table 20: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0102	0.0043	0.0005	0.0000	0.0360	0.7097	0.0076	0.0020	0.0001	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0102	1.0000	0.5613	0.1758	0.0000	0.8910	0.0400	0.7932	0.3719	0.0324	0.0013	0.0151	0.0001
boeck2019/multi_task_hjdb	0.0043	0.5613	1.0000	0.3719	0.0000	0.6753	0.0197	0.9376	0.5745	0.0617	0.0028	0.0254	0.0003
boeck2020/dar	0.0005	0.1758	0.3719	1.0000	0.0000	0.3648	0.0041	0.4428	0.9518	0.2092	0.0177	0.1006	0.0074
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0360	0.8910	0.6753	0.3648	0.0000	1.0000	0.0771	0.7435	0.4184	0.0416	0.0028	0.0431	0.0013
schreiber2014/default	0.7097	0.0400	0.0197	0.0041	0.0000	0.0771	1.0000	0.0048	0.0007	0.0001	0.0000	0.0001	0.0000
schreiber2017/ismir2017	0.0076	0.7932	0.9376	0.4428	0.0000	0.7435	0.0048	1.0000	0.0857	0.0478	0.0008	0.0245	0.0026
schreiber2017/mirex2017	0.0020	0.3719	0.5745	0.9518	0.0000	0.4184	0.0007	0.0857	1.0000	0.1542	0.0041	0.0881	0.0112
schreiber2018/cnn	0.0001	0.0324	0.0617	0.2092	0.0000	0.0416	0.0001	0.0478	0.1542	1.0000	0.1911	0.7931	0.1992
schreiber2018/fcn	0.0000	0.0013	0.0028	0.0177	0.0000	0.0028	0.0000	0.0008	0.0041	0.1911	1.0000	0.3085	0.9701
schreiber2018/ismir2018	0.0000	0.0151	0.0254	0.1006	0.0000	0.0431	0.0001	0.0245	0.0881	0.7931	0.3085	1.0000	0.2806
sun2021/default	0.0000	0.0001	0.0003	0.0074	0.0000	0.0013	0.0000	0.0026	0.0112	0.1992	0.9701	0.2806	1.0000

Table 21: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

AOE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₁ on Tempo-Subsets for 1.0

Figure 61: Mean AOE₁ for estimates compared to version 1.0 for tempo intervals around T.

AOE₁ on Tempo-Subsets for 2.0

Figure 62: Mean AOE₁ for estimates compared to version 2.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₂ on Tempo-Subsets for 1.0

Figure 63: Mean AOE₂ for estimates compared to version 1.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets for 2.0

Figure 64: Mean AOE₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated AOE₁ for Tempo

When fitting a generalized additive model (GAM) to AOE₁-values and a ground truth, what AOE₁ can we expect with confidence?

Estimated AOE₁ for Tempo for 1.0

Predictions of GAMs trained on AOE₁ for estimates for reference 1.0.

Figure 65: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₁ for Tempo for 2.0

Predictions of GAMs trained on AOE₁ for estimates for reference 2.0.

Figure 66: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo

When fitting a generalized additive model (GAM) to AOE₂-values and a ground truth, what AOE₂ can we expect with confidence?

Estimated AOE₂ for Tempo for 1.0

Predictions of GAMs trained on AOE₂ for estimates for reference 1.0.

Figure 67: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo for 2.0

Predictions of GAMs trained on AOE₂ for estimates for reference 2.0.

Figure 68: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

AOE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₁ for ‘tag_open’ Tags for 1.0

Figure 69: AOE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₁ for ‘tag_open’ Tags for 2.0

Figure 70: AOE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₂ for ‘tag_open’ Tags for 1.0

Figure 71: AOE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags for 2.0

Figure 72: AOE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.