smc_mirex

Smoothed Tempo Distribution

Figure 1: Percentage of values in tempo interval.

Tag Distribution for ‘tag_open’

Figure 2: Percentage of tracks tagged with tags from namespace ‘tag_open’. Annotations are from reference 1.0.

Beat-Based Tempo Variation

Figure 3: Fraction of the dataset with beat-annotated tracks with c_var < τ.

Estimates for ‘smc_mirex’

Estimators

boeck2015/tempodetector2016_default

Attribute	Value
Corpus	smc_mirex
Version	0.17.dev0
Annotation Tools	TempoDetector.2016, madmom, https://github.com/CPJKU/madmom
Annotator, bibtex	Boeck2015

boeck2019/multi_task

Attribute	Value
Corpus	smc_mirex
Version	0.0.1
Annotation Tools	model=multi_task, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2019/multi_task_hjdb

Attribute	Value
Corpus	smc_mirex
Version	0.0.1
Annotation Tools	model=multi_task_hjdb, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2020/dar

Attribute	Value
Corpus	smc_mirex
Version	0.0.1
Annotation Tools	https://github.com/superbock/ISMIR2020
Annotator, bibtex	Boeck2020

davies2009/mirex_qm_tempotracker

Attribute	Value
Corpus	smc_mirex
Version	1.0
Annotation Tools	QM Tempotracker, Sonic Annotator plugin. https://code.soundsoftware.ac.uk/projects/mirex2013/repository/show/audio_tempo_estimation/qm-tempotracker Note that the current macOS build of ‘qm-vamp-plugins’ was used.
Annotator, bibtex	Davies2009	Davies2007

percival2014/stem

Attribute	Value
Corpus	smc_mirex
Version	1.0
Annotation Tools	percival 2014, ‘tempo’ implementation from Marsyas, http://marsyas.info, git checkout tempo-stem
Annotator, bibtex	Percival2014

schreiber2014/default

Attribute	Value
Corpus	smc_mirex
Version	0.0.1
Annotation Tools	schreiber 2014, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2014

schreiber2017/ismir2017

Attribute	Value
Corpus	smc_mirex
Version	0.0.4
Annotation Tools	schreiber 2017, model=ismir2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2017/mirex2017

Attribute	Value
Corpus	smc_mirex
Version	0.0.4
Annotation Tools	schreiber 2017, model=mirex2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2018/cnn

Attribute	Value
Corpus	smc_mirex
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=cnn), https://github.com/hendriks73/tempo-cnn

schreiber2018/fcn

Attribute	Value
Corpus	smc_mirex
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=fcn), https://github.com/hendriks73/tempo-cnn

schreiber2018/ismir2018

Attribute	Value
Corpus	smc_mirex
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=ismir2018), https://github.com/hendriks73/tempo-cnn

sun2021/default

Attribute	Value
Corpus	smc_mirex
Version	0.0.2
Data Source	Xiaoheng Sun, Qiqi He, Yongwei Gao, Wei Li. Musical Tempo Estimation Using a Multi-scale Network. in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., Online, 2021
Annotation Tools	https://github.com/Qqi-HE/TempoEstimation_MGANet
Annotator, bibtex	Sun2021

Basic Statistics

Estimator	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
boeck2015/tempodetector2016_default	217	40.82	240.00	102.30	44.73	58.00	0.58
boeck2019/multi_task	217	33.74	188.73	74.52	22.09	53.00	0.85
boeck2019/multi_task_hjdb	217	34.66	179.94	76.43	21.45	53.00	0.85
boeck2020/dar	217	33.62	199.66	74.91	25.54	53.00	0.75
davies2009/mirex_qm_tempotracker	217	71.78	215.33	136.10	31.83	96.00	0.86
percival2014/stem	217	51.94	150.89	92.58	20.31	69.00	0.88
schreiber2014/default	217	47.79	154.05	88.34	20.46	61.00	0.88
schreiber2017/ismir2017	217	22.40	149.83	90.05	23.38	58.00	0.81
schreiber2017/mirex2017	217	11.20	176.64	82.92	28.18	56.00	0.67
schreiber2018/cnn	217	49.00	224.00	96.84	27.94	61.00	0.85
schreiber2018/fcn	217	38.00	198.00	94.95	31.03	63.00	0.75
schreiber2018/ismir2018	217	53.00	205.00	96.91	26.05	67.00	0.89
sun2021/default	217	40.00	232.00	91.19	34.12	59.00	0.82

Table 2: Basic statistics.

Smoothed Tempo Distribution

Figure 4: Percentage of values in tempo interval.

Accuracy

Accuracy₁ is defined as the percentage of correct estimates, allowing a 4% tolerance for individual BPM values.

Accuracy₂ additionally permits estimates to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave errors).

See [Gouyon2006].

Note: When comparing accuracy values for different algorithms, keep in mind that an algorithm may have been trained on the test set or that the test set may have even been created using one of the tested algorithms.

Accuracy Results for 1.0

Estimator	Accuracy1	Accuracy2
boeck2020/dar	0.5668	0.6959
boeck2015/tempodetector2016_default	0.4654	0.6728
boeck2019/multi_task	0.4470	0.6452
boeck2019/multi_task_hjdb	0.4424	0.6221
schreiber2017/mirex2017	0.4424	0.5622
sun2021/default	0.3779	0.4793
schreiber2018/fcn	0.3594	0.4793
schreiber2017/ismir2017	0.3548	0.5438
schreiber2014/default	0.3502	0.5484
schreiber2018/cnn	0.3410	0.5115
schreiber2018/ismir2018	0.3041	0.4793
percival2014/stem	0.2765	0.4562
davies2009/mirex_qm_tempotracker	0.1336	0.3180

Table 3: Mean accuracy of estimates compared to version 1.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 1.0

Figure 5: Mean Accuracy₁ for estimates compared to version 1.0 depending on tolerance.

Accuracy₂ for 1.0

Figure 6: Mean Accuracy₂ for estimates compared to version 1.0 depending on tolerance.

Accuracy Results for 2.0

Estimator	Accuracy1	Accuracy2
boeck2020/dar	0.5668	0.6959
boeck2015/tempodetector2016_default	0.4654	0.6728
boeck2019/multi_task	0.4470	0.6452
boeck2019/multi_task_hjdb	0.4424	0.6221
schreiber2017/mirex2017	0.4424	0.5622
sun2021/default	0.3779	0.4793
schreiber2018/fcn	0.3594	0.4793
schreiber2017/ismir2017	0.3548	0.5438
schreiber2014/default	0.3502	0.5484
schreiber2018/cnn	0.3410	0.5115
schreiber2018/ismir2018	0.3041	0.4793
percival2014/stem	0.2765	0.4562
davies2009/mirex_qm_tempotracker	0.1336	0.3134

Table 4: Mean accuracy of estimates compared to version 2.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 2.0

Figure 7: Mean Accuracy₁ for estimates compared to version 2.0 depending on tolerance.

Accuracy₂ for 2.0

Figure 8: Mean Accuracy₂ for estimates compared to version 2.0 depending on tolerance.

Differing Items

For which items did a given estimator not estimate a correct value with respect to a given ground truth? Are there items which are either very difficult, not suitable for the task, or incorrectly annotated and therefore never estimated correctly, regardless which estimator is used?

Differing Items Accuracy₁

Items with different tempo annotations (Accuracy₁, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (116 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_019’ … CSV

1.0 compared with boeck2019/multi_task (120 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (121 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ … CSV

1.0 compared with boeck2020/dar (94 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_015’ ‘SMC_018’ ‘SMC_019’ ‘SMC_021’ ‘SMC_022’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (188 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_012’ ‘SMC_013’ … CSV

1.0 compared with percival2014/stem (157 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_007’ ‘SMC_008’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

1.0 compared with schreiber2014/default (141 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ … CSV

1.0 compared with schreiber2017/ismir2017 (140 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ … CSV

1.0 compared with schreiber2017/mirex2017 (121 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_019’ … CSV

1.0 compared with schreiber2018/cnn (143 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ … CSV

1.0 compared with schreiber2018/fcn (139 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ … CSV

1.0 compared with schreiber2018/ismir2018 (151 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

1.0 compared with sun2021/default (135 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (116 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_019’ … CSV

2.0 compared with boeck2019/multi_task (120 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (121 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ … CSV

2.0 compared with boeck2020/dar (94 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_015’ ‘SMC_018’ ‘SMC_019’ ‘SMC_021’ ‘SMC_022’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (188 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_012’ ‘SMC_013’ … CSV

2.0 compared with percival2014/stem (157 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_007’ ‘SMC_008’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

2.0 compared with schreiber2014/default (141 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ … CSV

2.0 compared with schreiber2017/ismir2017 (140 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ … CSV

2.0 compared with schreiber2017/mirex2017 (121 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_019’ … CSV

2.0 compared with schreiber2018/cnn (143 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ … CSV

2.0 compared with schreiber2018/fcn (139 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ … CSV

2.0 compared with schreiber2018/ismir2018 (151 differences): ‘SMC_001’ ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

2.0 compared with sun2021/default (135 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ … CSV

None of the estimators estimated the following 37 items ‘correctly’ using Accuracy₁: ‘SMC_002’ ‘SMC_015’ ‘SMC_018’ ‘SMC_019’ ‘SMC_023’ ‘SMC_024’ ‘SMC_032’ ‘SMC_084’ ‘SMC_105’ ‘SMC_111’ ‘SMC_116’ … CSV

Differing Items Accuracy₂

Items with different tempo annotations (Accuracy₂, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (71 differences): ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ ‘SMC_024’ ‘SMC_032’ … CSV

1.0 compared with boeck2019/multi_task (77 differences): ‘SMC_002’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (82 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ … CSV

1.0 compared with boeck2020/dar (66 differences): ‘SMC_002’ ‘SMC_006’ ‘SMC_008’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ ‘SMC_024’ ‘SMC_028’ ‘SMC_032’ ‘SMC_033’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (148 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_012’ ‘SMC_013’ ‘SMC_014’ … CSV

1.0 compared with percival2014/stem (118 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_007’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_018’ ‘SMC_022’ ‘SMC_023’ … CSV

1.0 compared with schreiber2014/default (98 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ … CSV

1.0 compared with schreiber2017/ismir2017 (99 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ … CSV

1.0 compared with schreiber2017/mirex2017 (95 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ ‘SMC_028’ … CSV

1.0 compared with schreiber2018/cnn (106 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ ‘SMC_023’ … CSV

1.0 compared with schreiber2018/fcn (113 differences): ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ … CSV

1.0 compared with schreiber2018/ismir2018 (113 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_017’ ‘SMC_018’ … CSV

1.0 compared with sun2021/default (113 differences): ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_018’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (71 differences): ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ ‘SMC_024’ ‘SMC_032’ … CSV

2.0 compared with boeck2019/multi_task (77 differences): ‘SMC_002’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (82 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ … CSV

2.0 compared with boeck2020/dar (66 differences): ‘SMC_002’ ‘SMC_006’ ‘SMC_008’ ‘SMC_015’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ ‘SMC_024’ ‘SMC_028’ ‘SMC_032’ ‘SMC_033’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (149 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_011’ ‘SMC_012’ ‘SMC_013’ ‘SMC_014’ … CSV

2.0 compared with percival2014/stem (118 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_007’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_018’ ‘SMC_022’ ‘SMC_023’ … CSV

2.0 compared with schreiber2014/default (98 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ ‘SMC_022’ … CSV

2.0 compared with schreiber2017/ismir2017 (99 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_006’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ … CSV

2.0 compared with schreiber2017/mirex2017 (95 differences): ‘SMC_002’ ‘SMC_004’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_021’ ‘SMC_023’ ‘SMC_024’ ‘SMC_028’ … CSV

2.0 compared with schreiber2018/cnn (106 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_011’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ ‘SMC_023’ … CSV

2.0 compared with schreiber2018/fcn (113 differences): ‘SMC_003’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_011’ ‘SMC_014’ ‘SMC_015’ ‘SMC_017’ ‘SMC_018’ ‘SMC_021’ … CSV

2.0 compared with schreiber2018/ismir2018 (113 differences): ‘SMC_002’ ‘SMC_003’ ‘SMC_004’ ‘SMC_006’ ‘SMC_007’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_017’ ‘SMC_018’ … CSV

2.0 compared with sun2021/default (113 differences): ‘SMC_003’ ‘SMC_004’ ‘SMC_005’ ‘SMC_006’ ‘SMC_007’ ‘SMC_008’ ‘SMC_009’ ‘SMC_014’ ‘SMC_015’ ‘SMC_016’ ‘SMC_018’ … CSV

None of the estimators estimated the following 14 items ‘correctly’ using Accuracy₂: ‘SMC_015’ ‘SMC_032’ ‘SMC_111’ ‘SMC_137’ ‘SMC_158’ ‘SMC_174’ ‘SMC_209’ ‘SMC_215’ ‘SMC_223’ ‘SMC_226’ ‘SMC_235’ … CSV

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.6985	0.5758	0.0026	0.0000	0.0000	0.0022	0.0037	0.5682	0.0011	0.0052	0.0001	0.0248
boeck2019/multi_task	0.6985	1.0000	1.0000	0.0003	0.0000	0.0000	0.0125	0.0151	1.0000	0.0095	0.0327	0.0005	0.1060
boeck2019/multi_task_hjdb	0.5758	1.0000	1.0000	0.0002	0.0000	0.0000	0.0091	0.0183	0.8897	0.0092	0.0481	0.0004	0.0925
boeck2020/dar	0.0026	0.0003	0.0002	1.0000	0.0000	0.0000	0.0000	0.0000	0.0007	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0402	0.0241	0.0000	0.0649	0.0247	0.4514	0.0062
schreiber2014/default	0.0022	0.0125	0.0091	0.0000	0.0000	0.0402	1.0000	1.0000	0.0105	0.8679	0.8804	0.2026	0.4966
schreiber2017/ismir2017	0.0037	0.0151	0.0183	0.0000	0.0000	0.0241	1.0000	1.0000	0.0026	0.7660	1.0000	0.1524	0.6198
schreiber2017/mirex2017	0.5682	1.0000	0.8897	0.0007	0.0000	0.0000	0.0105	0.0026	1.0000	0.0054	0.0356	0.0003	0.1307
schreiber2018/cnn	0.0011	0.0095	0.0092	0.0000	0.0000	0.0649	0.8679	0.7660	0.0054	1.0000	0.6655	0.2005	0.3317
schreiber2018/fcn	0.0052	0.0327	0.0481	0.0000	0.0000	0.0247	0.8804	1.0000	0.0356	0.6655	1.0000	0.1114	0.6936
schreiber2018/ismir2018	0.0001	0.0005	0.0004	0.0000	0.0000	0.4514	0.2026	0.1524	0.0003	0.2005	0.1114	1.0000	0.0293
sun2021/default	0.0248	0.1060	0.0925	0.0000	0.0000	0.0062	0.4966	0.6198	0.1307	0.3317	0.6936	0.0293	1.0000

Table 5: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.6985	0.5758	0.0026	0.0000	0.0000	0.0022	0.0037	0.5682	0.0011	0.0052	0.0001	0.0248
boeck2019/multi_task	0.6985	1.0000	1.0000	0.0003	0.0000	0.0000	0.0125	0.0151	1.0000	0.0095	0.0327	0.0005	0.1060
boeck2019/multi_task_hjdb	0.5758	1.0000	1.0000	0.0002	0.0000	0.0000	0.0091	0.0183	0.8897	0.0092	0.0481	0.0004	0.0925
boeck2020/dar	0.0026	0.0003	0.0002	1.0000	0.0000	0.0000	0.0000	0.0000	0.0007	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0402	0.0241	0.0000	0.0649	0.0247	0.4514	0.0062
schreiber2014/default	0.0022	0.0125	0.0091	0.0000	0.0000	0.0402	1.0000	1.0000	0.0105	0.8679	0.8804	0.2026	0.4966
schreiber2017/ismir2017	0.0037	0.0151	0.0183	0.0000	0.0000	0.0241	1.0000	1.0000	0.0026	0.7660	1.0000	0.1524	0.6198
schreiber2017/mirex2017	0.5682	1.0000	0.8897	0.0007	0.0000	0.0000	0.0105	0.0026	1.0000	0.0054	0.0356	0.0003	0.1307
schreiber2018/cnn	0.0011	0.0095	0.0092	0.0000	0.0000	0.0649	0.8679	0.7660	0.0054	1.0000	0.6655	0.2005	0.3317
schreiber2018/fcn	0.0052	0.0327	0.0481	0.0000	0.0000	0.0247	0.8804	1.0000	0.0356	0.6655	1.0000	0.1114	0.6936
schreiber2018/ismir2018	0.0001	0.0005	0.0004	0.0000	0.0000	0.4514	0.2026	0.1524	0.0003	0.2005	0.1114	1.0000	0.0293
sun2021/default	0.0248	0.1060	0.0925	0.0000	0.0000	0.0062	0.4966	0.6198	0.1307	0.3317	0.6936	0.0293	1.0000

Table 6: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.4966	0.0989	0.5327	0.0000	0.0000	0.0004	0.0003	0.0015	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.4966	1.0000	0.4869	0.1081	0.0000	0.0000	0.0086	0.0071	0.0300	0.0009	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0989	0.4869	1.0000	0.0113	0.0000	0.0001	0.0402	0.0331	0.0984	0.0071	0.0005	0.0004	0.0002
boeck2020/dar	0.5327	0.1081	0.0113	1.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0002	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000	0.0001
percival2014/stem	0.0000	0.0000	0.0001	0.0000	0.0002	1.0000	0.0315	0.0377	0.0145	0.1550	0.6350	0.6025	0.6198
schreiber2014/default	0.0004	0.0086	0.0402	0.0000	0.0000	0.0315	1.0000	1.0000	0.7709	0.3740	0.0817	0.0872	0.0966
schreiber2017/ismir2017	0.0003	0.0071	0.0331	0.0000	0.0000	0.0377	1.0000	1.0000	0.3437	0.4497	0.1096	0.1096	0.1255
schreiber2017/mirex2017	0.0015	0.0300	0.0984	0.0002	0.0000	0.0145	0.7709	0.3437	1.0000	0.2077	0.0451	0.0385	0.0512
schreiber2018/cnn	0.0000	0.0009	0.0071	0.0000	0.0000	0.1550	0.3740	0.4497	0.2077	1.0000	0.4101	0.3713	0.4568
schreiber2018/fcn	0.0000	0.0000	0.0005	0.0000	0.0001	0.6350	0.0817	0.1096	0.0451	0.4101	1.0000	0.8897	0.8918
schreiber2018/ismir2018	0.0000	0.0000	0.0004	0.0000	0.0000	0.6025	0.0872	0.1096	0.0385	0.3713	0.8897	1.0000	0.8937
sun2021/default	0.0000	0.0000	0.0002	0.0000	0.0001	0.6198	0.0966	0.1255	0.0512	0.4568	0.8918	0.8937	1.0000

Table 7: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.4966	0.0989	0.5327	0.0000	0.0000	0.0004	0.0003	0.0015	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.4966	1.0000	0.4869	0.1081	0.0000	0.0000	0.0086	0.0071	0.0300	0.0009	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0989	0.4869	1.0000	0.0113	0.0000	0.0001	0.0402	0.0331	0.0984	0.0071	0.0005	0.0004	0.0002
boeck2020/dar	0.5327	0.1081	0.0113	1.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0001	0.0000	0.0001	1.0000	0.0315	0.0377	0.0145	0.1550	0.6350	0.6025	0.6198
schreiber2014/default	0.0004	0.0086	0.0402	0.0000	0.0000	0.0315	1.0000	1.0000	0.7709	0.3740	0.0817	0.0872	0.0966
schreiber2017/ismir2017	0.0003	0.0071	0.0331	0.0000	0.0000	0.0377	1.0000	1.0000	0.3437	0.4497	0.1096	0.1096	0.1255
schreiber2017/mirex2017	0.0015	0.0300	0.0984	0.0002	0.0000	0.0145	0.7709	0.3437	1.0000	0.2077	0.0451	0.0385	0.0512
schreiber2018/cnn	0.0000	0.0009	0.0071	0.0000	0.0000	0.1550	0.3740	0.4497	0.2077	1.0000	0.4101	0.3713	0.4568
schreiber2018/fcn	0.0000	0.0000	0.0005	0.0000	0.0000	0.6350	0.0817	0.1096	0.0451	0.4101	1.0000	0.8897	0.8918
schreiber2018/ismir2018	0.0000	0.0000	0.0004	0.0000	0.0000	0.6025	0.0872	0.1096	0.0385	0.3713	0.8897	1.0000	0.8937
sun2021/default	0.0000	0.0000	0.0002	0.0000	0.0000	0.6198	0.0966	0.1255	0.0512	0.4568	0.8918	0.8937	1.0000

Table 8: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Accuracy₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 9: Mean Accuracy₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

Accuracy₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 10: Mean Accuracy₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

Accuracy₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 11: Mean Accuracy₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

Accuracy₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 12: Mean Accuracy₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

Accuracy₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₁ on Tempo-Subsets for 1.0

Figure 13: Mean Accuracy₁ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₁ on Tempo-Subsets for 2.0

Figure 14: Mean Accuracy₁ for estimates compared to version 2.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₂ on Tempo-Subsets for 1.0

Figure 15: Mean Accuracy₂ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets for 2.0

Figure 16: Mean Accuracy₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated Accuracy₁ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₁-values and a ground truth, what Accuracy₁ can we expect with confidence?

Estimated Accuracy₁ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 1.0.

Figure 17: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₁ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 2.0.

Figure 18: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₂-values and a ground truth, what Accuracy₂ can we expect with confidence?

Estimated Accuracy₂ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 1.0.

Figure 19: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 2.0.

Figure 20: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Accuracy₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₁ for ‘tag_open’ Tags for 1.0

Figure 21: Mean Accuracy₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₁ for ‘tag_open’ Tags for 2.0

Figure 22: Mean Accuracy₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₂ for ‘tag_open’ Tags for 1.0

Figure 23: Mean Accuracy₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags for 2.0

Figure 24: Mean Accuracy₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₁ and OE₂

OE₁ is defined as octave error between an estimate E and a reference value R.This means that the most common errors—by a factor of 2 or ½—have the same magnitude, namely 1: OE₂(E) = log₂(E/R).

OE₂ is the signed OE₁ corresponding to the minimum absolute OE₁ allowing the octaveerrors 2, 3, 1/2, and 1/3: OE₂(E) = arg min_x(|x|) with x ∈ {OE₁(E), OE₁(2E), OE₁(3E), OE₁(½E), OE₁(⅓E)}

Mean OE₁/OE₂ Results for 1.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2020/dar	-0.0318	0.5132	0.0031	0.1401
sun2021/default	0.2525	0.5570	-0.0003	0.1905
boeck2019/multi_task_hjdb	0.0233	0.5676	-0.0128	0.1631
schreiber2018/ismir2018	0.3748	0.5772	-0.0151	0.2257
schreiber2018/cnn	0.3680	0.5783	-0.0043	0.1957
boeck2019/multi_task	-0.0177	0.5818	0.0015	0.1474
schreiber2014/default	0.2463	0.5907	-0.0214	0.1942
percival2014/stem	0.3198	0.6156	0.0071	0.2074
schreiber2018/fcn	0.3147	0.6172	-0.0134	0.2054
schreiber2017/ismir2017	0.2603	0.6227	-0.0178	0.1988
schreiber2017/mirex2017	0.1017	0.6319	-0.0208	0.1903
davies2009/mirex_qm_tempotracker	0.8693	0.6634	0.0506	0.2209
boeck2015/tempodetector2016_default	0.3676	0.6921	0.0234	0.1460

Table 9: Mean OE1/OE2 for estimates compared to version 1.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 1.0

Figure 25: OE₁ for estimates compared to version 1.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 1.0

Figure 26: OE₂ for estimates compared to version 1.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean OE₁/OE₂ Results for 2.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2020/dar	-0.0317	0.5134	0.0033	0.1399
sun2021/default	0.2526	0.5570	-0.0001	0.1909
boeck2019/multi_task_hjdb	0.0234	0.5676	-0.0127	0.1632
schreiber2018/ismir2018	0.3750	0.5772	-0.0149	0.2259
schreiber2018/cnn	0.3681	0.5783	-0.0041	0.1960
boeck2019/multi_task	-0.0175	0.5818	0.0017	0.1475
schreiber2014/default	0.2465	0.5906	-0.0213	0.1943
percival2014/stem	0.3200	0.6158	0.0073	0.2076
schreiber2018/fcn	0.3149	0.6172	-0.0132	0.2056
schreiber2017/ismir2017	0.2605	0.6228	-0.0176	0.1988
schreiber2017/mirex2017	0.1019	0.6320	-0.0206	0.1903
davies2009/mirex_qm_tempotracker	0.8695	0.6634	0.0508	0.2210
boeck2015/tempodetector2016_default	0.3678	0.6922	0.0236	0.1461

Table 10: Mean OE1/OE2 for estimates compared to version 2.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 2.0

Figure 27: OE₁ for estimates compared to version 2.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 2.0

Figure 28: OE₂ for estimates compared to version 2.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0000	0.2863	0.0096	0.0223	0.0000	0.9936	0.2140	0.8693	0.0079
boeck2019/multi_task	0.0000	1.0000	0.0503	0.6456	0.0000	0.0000	0.0000	0.0000	0.0027	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0503	1.0000	0.0763	0.0000	0.0000	0.0000	0.0000	0.0417	0.0000	0.0000	0.0000	0.0000
boeck2020/dar	0.0000	0.6456	0.0763	1.0000	0.0000	0.0000	0.0000	0.0000	0.0010	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.2863	0.0000	0.0000	0.0000	0.0000	1.0000	0.0112	0.0590	0.0000	0.1344	0.8895	0.0750	0.0583
schreiber2014/default	0.0096	0.0000	0.0000	0.0000	0.0000	0.0112	1.0000	0.6544	0.0004	0.0000	0.0626	0.0000	0.8546
schreiber2017/ismir2017	0.0223	0.0000	0.0000	0.0000	0.0000	0.0590	0.6544	1.0000	0.0000	0.0009	0.1842	0.0008	0.8450
schreiber2017/mirex2017	0.0000	0.0027	0.0417	0.0010	0.0000	0.0000	0.0004	0.0000	1.0000	0.0000	0.0000	0.0000	0.0004
schreiber2018/cnn	0.9936	0.0000	0.0000	0.0000	0.0000	0.1344	0.0000	0.0009	0.0000	1.0000	0.1067	0.7890	0.0003
schreiber2018/fcn	0.2140	0.0000	0.0000	0.0000	0.0000	0.8895	0.0626	0.1842	0.0000	0.1067	1.0000	0.0889	0.1089
schreiber2018/ismir2018	0.8693	0.0000	0.0000	0.0000	0.0000	0.0750	0.0000	0.0008	0.0000	0.7890	0.0889	1.0000	0.0001
sun2021/default	0.0079	0.0000	0.0000	0.0000	0.0000	0.0583	0.8546	0.8450	0.0004	0.0003	0.1089	0.0001	1.0000

Table 11: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0000	0.2863	0.0096	0.0223	0.0000	0.9936	0.2140	0.8693	0.0079
boeck2019/multi_task	0.0000	1.0000	0.0503	0.6456	0.0000	0.0000	0.0000	0.0000	0.0027	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task_hjdb	0.0000	0.0503	1.0000	0.0763	0.0000	0.0000	0.0000	0.0000	0.0417	0.0000	0.0000	0.0000	0.0000
boeck2020/dar	0.0000	0.6456	0.0763	1.0000	0.0000	0.0000	0.0000	0.0000	0.0010	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.2863	0.0000	0.0000	0.0000	0.0000	1.0000	0.0112	0.0590	0.0000	0.1344	0.8895	0.0750	0.0583
schreiber2014/default	0.0096	0.0000	0.0000	0.0000	0.0000	0.0112	1.0000	0.6544	0.0004	0.0000	0.0626	0.0000	0.8546
schreiber2017/ismir2017	0.0223	0.0000	0.0000	0.0000	0.0000	0.0590	0.6544	1.0000	0.0000	0.0009	0.1842	0.0008	0.8450
schreiber2017/mirex2017	0.0000	0.0027	0.0417	0.0010	0.0000	0.0000	0.0004	0.0000	1.0000	0.0000	0.0000	0.0000	0.0004
schreiber2018/cnn	0.9936	0.0000	0.0000	0.0000	0.0000	0.1344	0.0000	0.0009	0.0000	1.0000	0.1067	0.7890	0.0003
schreiber2018/fcn	0.2140	0.0000	0.0000	0.0000	0.0000	0.8895	0.0626	0.1842	0.0000	0.1067	1.0000	0.0889	0.1089
schreiber2018/ismir2018	0.8693	0.0000	0.0000	0.0000	0.0000	0.0750	0.0000	0.0008	0.0000	0.7890	0.0889	1.0000	0.0001
sun2021/default	0.0079	0.0000	0.0000	0.0000	0.0000	0.0583	0.8546	0.8450	0.0004	0.0003	0.1089	0.0001	1.0000

Table 12: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.1528	0.0199	0.1502	0.0704	0.3036	0.0023	0.0073	0.0025	0.0780	0.0232	0.0246	0.1242
boeck2019/multi_task	0.1528	1.0000	0.1338	0.8793	0.0040	0.7484	0.1273	0.2260	0.1324	0.7146	0.3344	0.3444	0.9051
boeck2019/multi_task_hjdb	0.0199	0.1338	1.0000	0.1411	0.0003	0.2710	0.5852	0.7564	0.5938	0.6123	0.9750	0.8939	0.4168
boeck2020/dar	0.1502	0.8793	0.1411	1.0000	0.0054	0.8154	0.1277	0.2019	0.1258	0.6429	0.3037	0.2990	0.8280
davies2009/mirex_qm_tempotracker	0.0704	0.0040	0.0003	0.0054	1.0000	0.0398	0.0001	0.0003	0.0001	0.0069	0.0010	0.0013	0.0061
percival2014/stem	0.3036	0.7484	0.2710	0.8154	0.0398	1.0000	0.1258	0.1370	0.1046	0.4850	0.2493	0.2250	0.6576
schreiber2014/default	0.0023	0.1273	0.5852	0.1277	0.0001	0.1258	1.0000	0.7634	0.9600	0.2881	0.6110	0.7158	0.2204
schreiber2017/ismir2017	0.0073	0.2260	0.7564	0.2019	0.0003	0.1370	0.7634	1.0000	0.7408	0.4096	0.7819	0.8812	0.3274
schreiber2017/mirex2017	0.0025	0.1324	0.5938	0.1258	0.0001	0.1046	0.9600	0.7408	1.0000	0.3037	0.6335	0.7456	0.2253
schreiber2018/cnn	0.0780	0.7146	0.6123	0.6429	0.0069	0.4850	0.2881	0.4096	0.3037	1.0000	0.5390	0.5151	0.8235
schreiber2018/fcn	0.0232	0.3344	0.9750	0.3037	0.0010	0.2493	0.6110	0.7819	0.6335	0.5390	1.0000	0.9184	0.4696
schreiber2018/ismir2018	0.0246	0.3444	0.8939	0.2990	0.0013	0.2250	0.7158	0.8812	0.7456	0.5151	0.9184	1.0000	0.4052
sun2021/default	0.1242	0.9051	0.4168	0.8280	0.0061	0.6576	0.2204	0.3274	0.2253	0.8235	0.4696	0.4052	1.0000

Table 13: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.1528	0.0199	0.1502	0.0704	0.3036	0.0023	0.0073	0.0025	0.0780	0.0232	0.0246	0.1242
boeck2019/multi_task	0.1528	1.0000	0.1338	0.8793	0.0040	0.7484	0.1273	0.2260	0.1324	0.7146	0.3344	0.3444	0.9051
boeck2019/multi_task_hjdb	0.0199	0.1338	1.0000	0.1411	0.0003	0.2710	0.5852	0.7564	0.5938	0.6123	0.9750	0.8939	0.4168
boeck2020/dar	0.1502	0.8793	0.1411	1.0000	0.0054	0.8154	0.1277	0.2019	0.1258	0.6429	0.3037	0.2990	0.8280
davies2009/mirex_qm_tempotracker	0.0704	0.0040	0.0003	0.0054	1.0000	0.0398	0.0001	0.0003	0.0001	0.0069	0.0010	0.0013	0.0061
percival2014/stem	0.3036	0.7484	0.2710	0.8154	0.0398	1.0000	0.1258	0.1370	0.1046	0.4850	0.2493	0.2250	0.6576
schreiber2014/default	0.0023	0.1273	0.5852	0.1277	0.0001	0.1258	1.0000	0.7634	0.9600	0.2881	0.6110	0.7158	0.2204
schreiber2017/ismir2017	0.0073	0.2260	0.7564	0.2019	0.0003	0.1370	0.7634	1.0000	0.7408	0.4096	0.7819	0.8812	0.3274
schreiber2017/mirex2017	0.0025	0.1324	0.5938	0.1258	0.0001	0.1046	0.9600	0.7408	1.0000	0.3037	0.6335	0.7456	0.2253
schreiber2018/cnn	0.0780	0.7146	0.6123	0.6429	0.0069	0.4850	0.2881	0.4096	0.3037	1.0000	0.5390	0.5151	0.8235
schreiber2018/fcn	0.0232	0.3344	0.9750	0.3037	0.0010	0.2493	0.6110	0.7819	0.6335	0.5390	1.0000	0.9184	0.4696
schreiber2018/ismir2018	0.0246	0.3444	0.8939	0.2990	0.0013	0.2250	0.7158	0.8812	0.7456	0.5151	0.9184	1.0000	0.4052
sun2021/default	0.1242	0.9051	0.4168	0.8280	0.0061	0.6576	0.2204	0.3274	0.2253	0.8235	0.4696	0.4052	1.0000

Table 14: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

OE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 29: Mean OE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

OE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 30: Mean OE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

OE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 31: Mean OE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

OE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 32: Mean OE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

OE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₁ on Tempo-Subsets for 1.0

Figure 33: Mean OE₁ for estimates compared to version 1.0 for tempo intervals around T.

OE₁ on Tempo-Subsets for 2.0

Figure 34: Mean OE₁ for estimates compared to version 2.0 for tempo intervals around T.

OE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₂ on Tempo-Subsets for 1.0

Figure 35: Mean OE₂ for estimates compared to version 1.0 for tempo intervals around T.

OE₂ on Tempo-Subsets for 2.0

Figure 36: Mean OE₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated OE₁ for Tempo

When fitting a generalized additive model (GAM) to OE₁-values and a ground truth, what OE₁ can we expect with confidence?

Estimated OE₁ for Tempo for 1.0

Predictions of GAMs trained on OE₁ for estimates for reference 1.0.

Figure 37: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₁ for Tempo for 2.0

Predictions of GAMs trained on OE₁ for estimates for reference 2.0.

Figure 38: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo

When fitting a generalized additive model (GAM) to OE₂-values and a ground truth, what OE₂ can we expect with confidence?

Estimated OE₂ for Tempo for 1.0

Predictions of GAMs trained on OE₂ for estimates for reference 1.0.

Figure 39: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo for 2.0

Predictions of GAMs trained on OE₂ for estimates for reference 2.0.

Figure 40: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

OE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₁ for ‘tag_open’ Tags for 1.0

Figure 41: OE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₁ for ‘tag_open’ Tags for 2.0

Figure 42: OE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₂ for ‘tag_open’ Tags for 1.0

Figure 43: OE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags for 2.0

Figure 44: OE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₁ and AOE₂

AOE₁ is defined as absolute octave error between an estimate and a reference value: AOE₁(E) = |log₂(E/R)|.

AOE₂ is the minimum of AOE₁ allowing the octave errors 2, 3, 1/2, and 1/3: AOE₂(E) = min(AOE₁(E), AOE₁(2E), AOE₁(3E), AOE₁(½E), AOE₁(⅓E)).

Mean AOE₁/AOE₂ Results for 1.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2020/dar	0.2893	0.4251	0.0748	0.1185
boeck2019/multi_task_hjdb	0.3661	0.4345	0.0949	0.1332
boeck2019/multi_task	0.3770	0.4435	0.0845	0.1208
sun2021/default	0.4101	0.4537	0.1285	0.1406
schreiber2017/mirex2017	0.4141	0.4880	0.1193	0.1497
schreiber2014/default	0.4576	0.4475	0.1268	0.1486
schreiber2018/fcn	0.4777	0.5018	0.1351	0.1552
schreiber2018/cnn	0.4820	0.4874	0.1272	0.1488
schreiber2017/ismir2017	0.4866	0.4676	0.1282	0.1529
boeck2015/tempodetector2016_default	0.5012	0.6025	0.0770	0.1262
schreiber2018/ismir2018	0.5014	0.4714	0.1546	0.1651
percival2014/stem	0.5055	0.4751	0.1464	0.1471
davies2009/mirex_qm_tempotracker	0.9124	0.6028	0.1675	0.1527

Table 15: Mean AOE1/AOE2 for estimates compared to version 1.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 1.0

Figure 45: AOE₁ for estimates compared to version 1.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 1.0

Figure 46: AOE₂ for estimates compared to version 1.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean AOE₁/AOE₂ Results for 2.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2020/dar	0.2894	0.4252	0.0746	0.1183
boeck2019/multi_task_hjdb	0.3662	0.4344	0.0951	0.1333
boeck2019/multi_task	0.3771	0.4434	0.0847	0.1208
sun2021/default	0.4102	0.4537	0.1286	0.1410
schreiber2017/mirex2017	0.4142	0.4881	0.1194	0.1496
schreiber2014/default	0.4577	0.4473	0.1270	0.1486
schreiber2018/fcn	0.4778	0.5018	0.1352	0.1554
schreiber2018/cnn	0.4821	0.4873	0.1273	0.1491
schreiber2017/ismir2017	0.4868	0.4678	0.1283	0.1528
boeck2015/tempodetector2016_default	0.5013	0.6026	0.0771	0.1263
schreiber2018/ismir2018	0.5016	0.4714	0.1548	0.1652
percival2014/stem	0.5056	0.4753	0.1466	0.1472
davies2009/mirex_qm_tempotracker	0.9126	0.6028	0.1677	0.1526

Table 16: Mean AOE1/AOE2 for estimates compared to version 2.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 2.0

Figure 47: AOE₁ for estimates compared to version 2.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 2.0

Figure 48: AOE₂ for estimates compared to version 2.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0076	0.0028	0.0000	0.0000	0.9178	0.3170	0.7240	0.0354	0.6262	0.5601	0.9953	0.0289
boeck2019/multi_task	0.0076	1.0000	0.5861	0.0017	0.0000	0.0002	0.0172	0.0015	0.2973	0.0074	0.0141	0.0015	0.3733
boeck2019/multi_task_hjdb	0.0028	0.5861	1.0000	0.0060	0.0000	0.0000	0.0033	0.0003	0.1626	0.0020	0.0055	0.0003	0.2166
boeck2020/dar	0.0000	0.0017	0.0060	1.0000	0.0000	0.0000	0.0000	0.0000	0.0007	0.0000	0.0000	0.0000	0.0011
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.9178	0.0002	0.0000	0.0000	0.0000	1.0000	0.0820	0.5074	0.0116	0.4472	0.4227	0.8909	0.0041
schreiber2014/default	0.3170	0.0172	0.0033	0.0000	0.0000	0.0820	1.0000	0.2737	0.2125	0.3601	0.5561	0.1023	0.1275
schreiber2017/ismir2017	0.7240	0.0015	0.0003	0.0000	0.0000	0.5074	0.2737	1.0000	0.0212	0.8765	0.8021	0.6343	0.0296
schreiber2017/mirex2017	0.0354	0.2973	0.1626	0.0007	0.0000	0.0116	0.2125	0.0212	1.0000	0.0651	0.1047	0.0199	0.9165
schreiber2018/cnn	0.6262	0.0074	0.0020	0.0000	0.0000	0.4472	0.3601	0.8765	0.0651	1.0000	0.8924	0.4249	0.0208
schreiber2018/fcn	0.5601	0.0141	0.0055	0.0000	0.0000	0.4227	0.5561	0.8021	0.1047	0.8924	1.0000	0.4833	0.0677
schreiber2018/ismir2018	0.9953	0.0015	0.0003	0.0000	0.0000	0.8909	0.1023	0.6343	0.0199	0.4249	0.4833	1.0000	0.0026
sun2021/default	0.0289	0.3733	0.2166	0.0011	0.0000	0.0041	0.1275	0.0296	0.9165	0.0208	0.0677	0.0026	1.0000

Table 17: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.0076	0.0028	0.0000	0.0000	0.9176	0.3171	0.7242	0.0354	0.6264	0.5600	0.9936	0.0290
boeck2019/multi_task	0.0076	1.0000	0.5861	0.0017	0.0000	0.0002	0.0172	0.0015	0.2973	0.0074	0.0141	0.0015	0.3734
boeck2019/multi_task_hjdb	0.0028	0.5861	1.0000	0.0060	0.0000	0.0000	0.0033	0.0003	0.1625	0.0020	0.0055	0.0003	0.2167
boeck2020/dar	0.0000	0.0017	0.0060	1.0000	0.0000	0.0000	0.0000	0.0000	0.0006	0.0000	0.0000	0.0000	0.0011
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.9176	0.0002	0.0000	0.0000	0.0000	1.0000	0.0820	0.5074	0.0116	0.4472	0.4225	0.8931	0.0041
schreiber2014/default	0.3171	0.0172	0.0033	0.0000	0.0000	0.0820	1.0000	0.2737	0.2125	0.3601	0.5564	0.1016	0.1274
schreiber2017/ismir2017	0.7242	0.0015	0.0003	0.0000	0.0000	0.5074	0.2737	1.0000	0.0212	0.8765	0.8018	0.6325	0.0296
schreiber2017/mirex2017	0.0354	0.2973	0.1625	0.0006	0.0000	0.0116	0.2125	0.0212	1.0000	0.0651	0.1048	0.0197	0.9163
schreiber2018/cnn	0.6264	0.0074	0.0020	0.0000	0.0000	0.4472	0.3601	0.8765	0.0651	1.0000	0.8920	0.4229	0.0208
schreiber2018/fcn	0.5600	0.0141	0.0055	0.0000	0.0000	0.4225	0.5564	0.8018	0.1048	0.8920	1.0000	0.4814	0.0677
schreiber2018/ismir2018	0.9936	0.0015	0.0003	0.0000	0.0000	0.8931	0.1016	0.6325	0.0197	0.4229	0.4814	1.0000	0.0026
sun2021/default	0.0290	0.3734	0.2167	0.0011	0.0000	0.0041	0.1274	0.0296	0.9163	0.0208	0.0677	0.0026	1.0000

Table 18: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.3943	0.0554	0.8077	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.3943	1.0000	0.1013	0.2022	0.0000	0.0000	0.0001	0.0002	0.0016	0.0004	0.0000	0.0000	0.0001
boeck2019/multi_task_hjdb	0.0554	0.1013	1.0000	0.0106	0.0000	0.0000	0.0034	0.0033	0.0223	0.0092	0.0010	0.0000	0.0031
boeck2020/dar	0.8077	0.2022	0.0106	1.0000	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0976	0.0017	0.0040	0.0004	0.0040	0.0180	0.3658	0.0031
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0976	1.0000	0.1049	0.1364	0.0278	0.0621	0.3485	0.4708	0.1072
schreiber2014/default	0.0000	0.0001	0.0034	0.0000	0.0017	0.1049	1.0000	0.8844	0.4248	0.9787	0.5182	0.0339	0.8885
schreiber2017/ismir2017	0.0000	0.0002	0.0033	0.0000	0.0040	0.1364	0.8844	1.0000	0.1994	0.9303	0.5872	0.0303	0.9834
schreiber2017/mirex2017	0.0002	0.0016	0.0223	0.0001	0.0004	0.0278	0.4248	0.1994	1.0000	0.5222	0.2088	0.0042	0.4290
schreiber2018/cnn	0.0000	0.0004	0.0092	0.0000	0.0040	0.0621	0.9787	0.9303	0.5222	1.0000	0.4522	0.0026	0.9053
schreiber2018/fcn	0.0000	0.0000	0.0010	0.0000	0.0180	0.3485	0.5182	0.5872	0.2088	0.4522	1.0000	0.0664	0.5444
schreiber2018/ismir2018	0.0000	0.0000	0.0000	0.0000	0.3658	0.4708	0.0339	0.0303	0.0042	0.0026	0.0664	1.0000	0.0244
sun2021/default	0.0000	0.0001	0.0031	0.0000	0.0031	0.1072	0.8885	0.9834	0.4290	0.9053	0.5444	0.0244	1.0000

Table 19: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default
boeck2015/tempodetector2016_default	1.0000	0.3906	0.0545	0.7859	0.0000	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.3906	1.0000	0.1007	0.1872	0.0000	0.0000	0.0001	0.0002	0.0016	0.0004	0.0000	0.0000	0.0001
boeck2019/multi_task_hjdb	0.0545	0.1007	1.0000	0.0094	0.0000	0.0000	0.0034	0.0033	0.0227	0.0092	0.0011	0.0000	0.0031
boeck2020/dar	0.7859	0.1872	0.0094	1.0000	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0974	0.0017	0.0039	0.0003	0.0040	0.0178	0.3658	0.0031
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0974	1.0000	0.1050	0.1352	0.0274	0.0622	0.3459	0.4702	0.1075
schreiber2014/default	0.0000	0.0001	0.0034	0.0000	0.0017	0.1050	1.0000	0.8904	0.4202	0.9788	0.5220	0.0338	0.8879
schreiber2017/ismir2017	0.0000	0.0002	0.0033	0.0000	0.0039	0.1352	0.8904	1.0000	0.1992	0.9349	0.5873	0.0299	0.9780
schreiber2017/mirex2017	0.0002	0.0016	0.0227	0.0001	0.0003	0.0274	0.4202	0.1992	1.0000	0.5184	0.2087	0.0041	0.4250
schreiber2018/cnn	0.0000	0.0004	0.0092	0.0000	0.0040	0.0622	0.9788	0.9349	0.5184	1.0000	0.4565	0.0026	0.9046
schreiber2018/fcn	0.0000	0.0000	0.0011	0.0000	0.0178	0.3459	0.5220	0.5873	0.2087	0.4565	1.0000	0.0655	0.5496
schreiber2018/ismir2018	0.0000	0.0000	0.0000	0.0000	0.3658	0.4702	0.0338	0.0299	0.0041	0.0026	0.0655	1.0000	0.0245
sun2021/default	0.0000	0.0001	0.0031	0.0000	0.0031	0.1075	0.8879	0.9780	0.4250	0.9046	0.5496	0.0245	1.0000

Table 20: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

AOE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 49: Mean AOE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

AOE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 50: Mean AOE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

AOE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 51: Mean AOE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

AOE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 52: Mean AOE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

AOE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₁ on Tempo-Subsets for 1.0

Figure 53: Mean AOE₁ for estimates compared to version 1.0 for tempo intervals around T.

AOE₁ on Tempo-Subsets for 2.0

Figure 54: Mean AOE₁ for estimates compared to version 2.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₂ on Tempo-Subsets for 1.0

Figure 55: Mean AOE₂ for estimates compared to version 1.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets for 2.0

Figure 56: Mean AOE₂ for estimates compared to version 2.0 for tempo intervals around T.

Estimated AOE₁ for Tempo

When fitting a generalized additive model (GAM) to AOE₁-values and a ground truth, what AOE₁ can we expect with confidence?

Estimated AOE₁ for Tempo for 1.0

Predictions of GAMs trained on AOE₁ for estimates for reference 1.0.

Figure 57: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₁ for Tempo for 2.0

Predictions of GAMs trained on AOE₁ for estimates for reference 2.0.

Figure 58: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo

When fitting a generalized additive model (GAM) to AOE₂-values and a ground truth, what AOE₂ can we expect with confidence?

Estimated AOE₂ for Tempo for 1.0

Predictions of GAMs trained on AOE₂ for estimates for reference 1.0.

Figure 59: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo for 2.0

Predictions of GAMs trained on AOE₂ for estimates for reference 2.0.

Figure 60: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

AOE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₁ for ‘tag_open’ Tags for 1.0

Figure 61: AOE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₁ for ‘tag_open’ Tags for 2.0

Figure 62: AOE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₂ for ‘tag_open’ Tags for 1.0

Figure 63: AOE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags for 2.0

Figure 64: AOE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.