hainsworth

Smoothed Tempo Distribution

Figure 1: Percentage of values in tempo interval.

Tag Distribution for ‘tag_open’

Figure 2: Percentage of tracks tagged with tags from namespace ‘tag_open’. Annotations are from reference 1.0.

Beat-Based Tempo Variation

Figure 3: Fraction of the dataset with beat-annotated tracks with c_var < τ.

Estimates for ‘hainsworth’

Estimators

boeck2015/tempodetector2016_default

Attribute	Value
Corpus	hainsworth
Version	0.17.dev0
Annotation Tools	TempoDetector.2016, madmom, https://github.com/CPJKU/madmom
Annotator, bibtex	Boeck2015

boeck2019/multi_task

Attribute	Value
Corpus	hainsworth
Version	0.0.1
Annotation Tools	model=multi_task, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2019/multi_task_hjdb

Attribute	Value
Corpus	hainsworth
Version	0.0.1
Annotation Tools	model=multi_task_hjdb, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2020/dar

Attribute	Value
Corpus	hainsworth
Version	0.0.1
Annotation Tools	https://github.com/superbock/ISMIR2020
Annotator, bibtex	Boeck2020

davies2009/mirex_qm_tempotracker

Attribute	Value
Corpus	hainsworth
Version	1.0
Annotation Tools	QM Tempotracker, Sonic Annotator plugin. https://code.soundsoftware.ac.uk/projects/mirex2013/repository/show/audio_tempo_estimation/qm-tempotracker Note that the current macOS build of ‘qm-vamp-plugins’ was used.
Annotator, bibtex	Davies2009	Davies2007

echonest/version_3_2_1

Attribute	Value
Corpus	hainsworth
Version	3.2.1
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	Echo Nest track analyzer v3.2.1
Annotator, bibtex	Percival2014

gkiokas2012/default

Attribute	Value
Corpus	hainsworth
Version	1.0
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	Gkiokas2012
Annotator, bibtex	Gkiokas2012

klapuri2006/percival2014

Attribute	Value
Corpus	hainsworth
Version	1.0
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	Klapuri 2006
Annotator, bibtex	Klapuri2006

oliveira2010/ibt

Attribute	Value
Corpus	hainsworth
Version	1.0
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	Oliveira 2010
Annotator, bibtex	Oliveira2010

percival2014/stem

Attribute	Value
Corpus	hainsworth
Version	1.0
Annotation Tools	percival 2014, ‘tempo’ implementation from Marsyas, http://marsyas.info, git checkout tempo-stem
Annotator, bibtex	Percival2014

scheirer1998/percival2014

Attribute	Value
Corpus	hainsworth
Version	1.0
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	Scheirer 1998
Annotator, bibtex	Scheirer1998

schreiber2014/default

Attribute	Value
Corpus	hainsworth
Version	0.0.1
Annotation Tools	schreiber 2014, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2014

schreiber2017/ismir2017

Attribute	Value
Corpus	hainsworth
Version	0.0.4
Annotation Tools	schreiber 2017, model=ismir2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2017/mirex2017

Attribute	Value
Corpus	hainsworth
Version	0.0.4
Annotation Tools	schreiber 2017, model=mirex2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2018/cnn

Attribute	Value
Corpus	hainsworth
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=cnn), https://github.com/hendriks73/tempo-cnn

schreiber2018/fcn

Attribute	Value
Corpus	hainsworth
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=fcn), https://github.com/hendriks73/tempo-cnn

schreiber2018/ismir2018

Attribute	Value
Corpus	hainsworth
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=ismir2018), https://github.com/hendriks73/tempo-cnn

sun2021/default

Attribute	Value
Corpus	hainsworth
Version	0.0.2
Data Source	Xiaoheng Sun, Qiqi He, Yongwei Gao, Wei Li. Musical Tempo Estimation Using a Multi-scale Network. in Proc. of the 22nd Int. Society for Music Information Retrieval Conf., Online, 2021
Annotation Tools	https://github.com/Qqi-HE/TempoEstimation_MGANet
Annotator, bibtex	Sun2021

zplane/auftakt_v3

Attribute	Value
Corpus	hainsworth
Version	3.0
Data Source	Graham Percival and George Tzanetakis. Streamlined tempo estimation based on autocorrelation and crosscorrelation with pulses. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12):1765–1776, 2014.
Annotation Tools	zplane aufTAKT version 3.0, http://licensing.zplane.de/technology#auftakt
Annotator, bibtex	Percival2014

Basic Statistics

Estimator	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
boeck2015/tempodetector2016_default	222	41.96	230.77	112.29	30.61	70.00	0.82
boeck2019/multi_task	222	55.32	208.46	112.94	29.11	71.00	0.84
boeck2019/multi_task_hjdb	222	45.53	208.12	112.57	28.84	70.00	0.82
boeck2020/dar	222	33.35	231.58	114.71	33.64	79.00	0.78
davies2009/mirex_qm_tempotracker	222	80.75	234.91	125.65	26.74	81.00	0.93
echonest/version_3_2_1	221	58.30	191.72	100.20	27.44	71.00	0.76
gkiokas2012/default	222	52.00	244.00	112.24	31.59	73.00	0.83
klapuri2006/percival2014	222	74.36	161.50	114.11	19.76	76.00	0.98
oliveira2010/ibt	222	82.00	161.00	116.20	20.30	81.00	1.00
percival2014/stem	222	50.79	152.00	105.84	22.20	72.00	0.93
scheirer1998/percival2014	212	61.35	181.82	109.47	28.37	74.00	0.81
schreiber2014/default	222	54.85	164.50	101.41	22.60	69.00	0.90
schreiber2017/ismir2017	222	26.50	193.51	106.25	26.09	71.00	0.84
schreiber2017/mirex2017	222	13.25	197.54	105.21	29.14	74.00	0.81
schreiber2018/cnn	222	63.00	216.00	116.03	29.14	81.00	0.88
schreiber2018/fcn	222	50.00	208.00	114.58	30.17	75.00	0.82
schreiber2018/ismir2018	222	65.00	208.00	114.31	26.12	77.00	0.91
sun2021/default	222	56.00	218.00	115.41	32.97	73.00	0.79
zplane/auftakt_v3	222	65.50	164.80	111.46	22.53	76.00	0.92

Table 2: Basic statistics.

Smoothed Tempo Distribution

Figure 4: Percentage of values in tempo interval.

Accuracy

Accuracy₁ is defined as the percentage of correct estimates, allowing a 4% tolerance for individual BPM values.

Accuracy₂ additionally permits estimates to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave errors).

See [Gouyon2006].

Note: When comparing accuracy values for different algorithms, keep in mind that an algorithm may have been trained on the test set or that the test set may have even been created using one of the tested algorithms.

Accuracy Results for 1.0

Estimator	Accuracy1	Accuracy2
boeck2020/dar	0.8108	0.8919
boeck2015/tempodetector2016_default	0.8063	0.8829
sun2021/default	0.8018	0.9099
boeck2019/multi_task_hjdb	0.7973	0.8874
boeck2019/multi_task	0.7973	0.8964
schreiber2018/ismir2018	0.7748	0.8423
schreiber2018/fcn	0.7703	0.8649
schreiber2018/cnn	0.7658	0.8468
schreiber2017/mirex2017	0.7387	0.8604
schreiber2017/ismir2017	0.7297	0.8514
oliveira2010/ibt	0.7252	0.8198
davies2009/mirex_qm_tempotracker	0.7207	0.8288
klapuri2006/percival2014	0.7162	0.8423
schreiber2014/default	0.7072	0.8694
zplane/auftakt_v3	0.6982	0.8243
percival2014/stem	0.6982	0.8694
echonest/version_3_2_1	0.6667	0.8559
gkiokas2012/default	0.6441	0.8468
scheirer1998/percival2014	0.4910	0.6532

Table 3: Mean accuracy of estimates compared to version 1.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 1.0

Figure 5: Mean Accuracy₁ for estimates compared to version 1.0 depending on tolerance.

Accuracy₂ for 1.0

Figure 6: Mean Accuracy₂ for estimates compared to version 1.0 depending on tolerance.

Accuracy Results for 2.0

Estimator	Accuracy1	Accuracy2
boeck2020/dar	0.8514	0.9459
boeck2015/tempodetector2016_default	0.8514	0.9279
boeck2019/multi_task_hjdb	0.8333	0.9324
boeck2019/multi_task	0.8243	0.9324
sun2021/default	0.8243	0.9279
schreiber2018/ismir2018	0.8018	0.8784
schreiber2018/fcn	0.7928	0.8919
schreiber2018/cnn	0.7838	0.8739
davies2009/mirex_qm_tempotracker	0.7523	0.8649
schreiber2017/mirex2017	0.7477	0.8964
schreiber2017/ismir2017	0.7477	0.8919
oliveira2010/ibt	0.7432	0.8378
klapuri2006/percival2014	0.7297	0.8559
zplane/auftakt_v3	0.7117	0.8468
percival2014/stem	0.7117	0.9054
schreiber2014/default	0.6982	0.8829
echonest/version_3_2_1	0.6847	0.8829
gkiokas2012/default	0.6757	0.8829
scheirer1998/percival2014	0.5180	0.6847

Table 4: Mean accuracy of estimates compared to version 2.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 2.0

Figure 7: Mean Accuracy₁ for estimates compared to version 2.0 depending on tolerance.

Accuracy₂ for 2.0

Figure 8: Mean Accuracy₂ for estimates compared to version 2.0 depending on tolerance.

Accuracy Results for 3.0

Estimator	Accuracy1	Accuracy2
boeck2015/tempodetector2016_default	0.8604	0.9369
boeck2020/dar	0.8514	0.9414
boeck2019/multi_task_hjdb	0.8243	0.9279
sun2021/default	0.8198	0.9279
boeck2019/multi_task	0.8153	0.9279
schreiber2018/ismir2018	0.8018	0.8739
schreiber2018/fcn	0.7928	0.8919
schreiber2018/cnn	0.7793	0.8694
davies2009/mirex_qm_tempotracker	0.7523	0.8559
schreiber2017/ismir2017	0.7477	0.8919
schreiber2017/mirex2017	0.7432	0.8919
oliveira2010/ibt	0.7432	0.8378
klapuri2006/percival2014	0.7297	0.8559
zplane/auftakt_v3	0.7117	0.8423
percival2014/stem	0.7117	0.8919
schreiber2014/default	0.7072	0.8829
echonest/version_3_2_1	0.6802	0.8784
gkiokas2012/default	0.6712	0.8784
scheirer1998/percival2014	0.5180	0.6892

Table 5: Mean accuracy of estimates compared to version 3.0 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 3.0

Figure 9: Mean Accuracy₁ for estimates compared to version 3.0 depending on tolerance.

Accuracy₂ for 3.0

Figure 10: Mean Accuracy₂ for estimates compared to version 3.0 depending on tolerance.

Differing Items

For which items did a given estimator not estimate a correct value with respect to a given ground truth? Are there items which are either very difficult, not suitable for the task, or incorrectly annotated and therefore never estimated correctly, regardless which estimator is used?

Differing Items Accuracy₁

Items with different tempo annotations (Accuracy₁, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (43 differences): ‘006’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘038’ ‘053’ ‘055’ ‘058’ ‘059’ … CSV

1.0 compared with boeck2019/multi_task (45 differences): ‘006’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘048’ ‘055’ ‘057’ ‘058’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (45 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘055’ ‘057’ ‘058’ … CSV

1.0 compared with boeck2020/dar (42 differences): ‘006’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘059’ ‘062’ ‘072’ ‘073’ ‘075’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (62 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘035’ ‘037’ ‘047’ ‘055’ … CSV

1.0 compared with echonest/version_3_2_1 (74 differences): ‘003’ ‘006’ ‘009’ ‘012’ ‘013’ ‘019’ ‘024’ ‘037’ ‘053’ ‘055’ ‘059’ … CSV

1.0 compared with gkiokas2012/default (79 differences): ‘003’ ‘006’ ‘007’ ‘008’ ‘009’ ‘010’ ‘012’ ‘013’ ‘022’ ‘024’ ‘037’ … CSV

1.0 compared with klapuri2006/percival2014 (63 differences): ‘006’ ‘007’ ‘009’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘053’ ‘055’ … CSV

1.0 compared with oliveira2010/ibt (61 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘035’ ‘047’ ‘055’ ‘058’ … CSV

1.0 compared with percival2014/stem (67 differences): ‘006’ ‘007’ ‘009’ ‘012’ ‘013’ ‘022’ ‘024’ ‘042’ ‘047’ ‘053’ ‘055’ … CSV

1.0 compared with scheirer1998/percival2014 (113 differences): ‘001’ ‘002’ ‘003’ ‘006’ ‘007’ ‘009’ ‘010’ ‘012’ ‘013’ ‘019’ ‘020’ … CSV

1.0 compared with schreiber2014/default (65 differences): ‘006’ ‘007’ ‘009’ ‘012’ ‘013’ ‘016’ ‘022’ ‘024’ ‘037’ ‘052’ ‘053’ … CSV

1.0 compared with schreiber2017/ismir2017 (60 differences): ‘006’ ‘009’ ‘012’ ‘013’ ‘016’ ‘024’ ‘025’ ‘055’ ‘058’ ‘059’ ‘061’ … CSV

1.0 compared with schreiber2017/mirex2017 (58 differences): ‘006’ ‘009’ ‘012’ ‘013’ ‘024’ ‘025’ ‘035’ ‘058’ ‘059’ ‘061’ ‘062’ … CSV

1.0 compared with schreiber2018/cnn (52 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘019’ ‘022’ ‘024’ ‘037’ ‘056’ ‘057’ ‘059’ … CSV

1.0 compared with schreiber2018/fcn (51 differences): ‘006’ ‘007’ ‘010’ ‘012’ ‘013’ ‘024’ ‘035’ ‘037’ ‘052’ ‘059’ ‘062’ … CSV

1.0 compared with schreiber2018/ismir2018 (50 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘037’ ‘047’ ‘053’ ‘057’ … CSV

1.0 compared with sun2021/default (44 differences): ‘006’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘056’ ‘057’ ‘059’ ‘062’ ‘072’ … CSV

1.0 compared with zplane/auftakt_v3 (67 differences): ‘006’ ‘007’ ‘010’ ‘012’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘055’ ‘057’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (33 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘038’ ‘053’ ‘055’ ‘058’ ‘059’ ‘066’ ‘072’ … CSV

2.0 compared with boeck2019/multi_task (39 differences): ‘006’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘048’ ‘055’ ‘058’ ‘059’ ‘072’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (37 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘058’ ‘059’ ‘060’ ‘072’ ‘073’ … CSV

2.0 compared with boeck2020/dar (33 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘059’ ‘072’ ‘073’ ‘075’ ‘097’ ‘103’ ‘107’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (55 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘057’ ‘058’ ‘059’ … CSV

2.0 compared with echonest/version_3_2_1 (70 differences): ‘003’ ‘006’ ‘009’ ‘013’ ‘019’ ‘024’ ‘037’ ‘053’ ‘055’ ‘059’ ‘078’ … CSV

2.0 compared with gkiokas2012/default (72 differences): ‘003’ ‘006’ ‘007’ ‘008’ ‘009’ ‘010’ ‘013’ ‘022’ ‘053’ ‘055’ ‘059’ … CSV

2.0 compared with klapuri2006/percival2014 (60 differences): ‘006’ ‘007’ ‘009’ ‘013’ ‘022’ ‘025’ ‘047’ ‘053’ ‘055’ ‘057’ ‘059’ … CSV

2.0 compared with oliveira2010/ibt (57 differences): ‘006’ ‘007’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘055’ ‘058’ ‘059’ ‘069’ … CSV

2.0 compared with percival2014/stem (64 differences): ‘006’ ‘007’ ‘009’ ‘013’ ‘022’ ‘042’ ‘047’ ‘053’ ‘055’ ‘057’ ‘059’ … CSV

2.0 compared with scheirer1998/percival2014 (107 differences): ‘001’ ‘002’ ‘003’ ‘006’ ‘007’ ‘009’ ‘010’ ‘012’ ‘013’ ‘019’ ‘020’ … CSV

2.0 compared with schreiber2014/default (67 differences): ‘006’ ‘007’ ‘009’ ‘013’ ‘016’ ‘022’ ‘024’ ‘052’ ‘053’ ‘055’ ‘059’ … CSV

2.0 compared with schreiber2017/ismir2017 (56 differences): ‘006’ ‘009’ ‘013’ ‘016’ ‘024’ ‘025’ ‘055’ ‘058’ ‘059’ ‘061’ ‘067’ … CSV

2.0 compared with schreiber2017/mirex2017 (56 differences): ‘006’ ‘009’ ‘013’ ‘025’ ‘035’ ‘058’ ‘059’ ‘061’ ‘062’ ‘067’ ‘075’ … CSV

2.0 compared with schreiber2018/cnn (48 differences): ‘006’ ‘007’ ‘013’ ‘019’ ‘022’ ‘056’ ‘059’ ‘060’ ‘062’ ‘070’ ‘073’ … CSV

2.0 compared with schreiber2018/fcn (46 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘035’ ‘052’ ‘059’ ‘062’ ‘074’ ‘075’ ‘079’ … CSV

2.0 compared with schreiber2018/ismir2018 (44 differences): ‘006’ ‘007’ ‘013’ ‘022’ ‘025’ ‘047’ ‘053’ ‘057’ ‘059’ ‘062’ ‘075’ … CSV

2.0 compared with sun2021/default (39 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘056’ ‘057’ ‘059’ ‘062’ ‘072’ ‘073’ ‘075’ … CSV

2.0 compared with zplane/auftakt_v3 (64 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘057’ ‘059’ ‘062’ … CSV

3.0 compared with boeck2015/tempodetector2016_default (31 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘038’ ‘053’ ‘055’ ‘058’ ‘059’ ‘066’ ‘072’ … CSV

3.0 compared with boeck2019/multi_task (41 differences): ‘006’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘048’ ‘055’ ‘058’ ‘059’ ‘072’ … CSV

3.0 compared with boeck2019/multi_task_hjdb (39 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘058’ ‘059’ ‘060’ ‘072’ ‘073’ … CSV

3.0 compared with boeck2020/dar (33 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘059’ ‘072’ ‘073’ ‘075’ ‘097’ ‘103’ ‘107’ … CSV

3.0 compared with davies2009/mirex_qm_tempotracker (55 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘057’ ‘058’ ‘059’ … CSV

3.0 compared with echonest/version_3_2_1 (71 differences): ‘003’ ‘006’ ‘007’ ‘009’ ‘013’ ‘019’ ‘024’ ‘037’ ‘053’ ‘055’ ‘059’ … CSV

3.0 compared with gkiokas2012/default (73 differences): ‘003’ ‘006’ ‘007’ ‘008’ ‘009’ ‘010’ ‘013’ ‘022’ ‘053’ ‘055’ ‘059’ … CSV

3.0 compared with klapuri2006/percival2014 (60 differences): ‘006’ ‘007’ ‘009’ ‘013’ ‘022’ ‘025’ ‘047’ ‘053’ ‘055’ ‘057’ ‘059’ … CSV

3.0 compared with oliveira2010/ibt (57 differences): ‘006’ ‘007’ ‘013’ ‘022’ ‘024’ ‘025’ ‘047’ ‘055’ ‘058’ ‘059’ ‘069’ … CSV

3.0 compared with percival2014/stem (64 differences): ‘006’ ‘007’ ‘009’ ‘013’ ‘022’ ‘042’ ‘047’ ‘053’ ‘055’ ‘059’ ‘067’ … CSV

3.0 compared with scheirer1998/percival2014 (107 differences): ‘001’ ‘002’ ‘003’ ‘006’ ‘007’ ‘009’ ‘010’ ‘012’ ‘013’ ‘019’ ‘020’ … CSV

3.0 compared with schreiber2014/default (65 differences): ‘006’ ‘009’ ‘013’ ‘016’ ‘022’ ‘024’ ‘052’ ‘053’ ‘055’ ‘059’ ‘062’ … CSV

3.0 compared with schreiber2017/ismir2017 (56 differences): ‘006’ ‘009’ ‘013’ ‘016’ ‘024’ ‘025’ ‘055’ ‘058’ ‘059’ ‘061’ ‘067’ … CSV

3.0 compared with schreiber2017/mirex2017 (57 differences): ‘006’ ‘009’ ‘013’ ‘025’ ‘035’ ‘058’ ‘059’ ‘061’ ‘062’ ‘067’ ‘075’ … CSV

3.0 compared with schreiber2018/cnn (49 differences): ‘006’ ‘007’ ‘013’ ‘019’ ‘022’ ‘056’ ‘059’ ‘060’ ‘062’ ‘070’ ‘073’ … CSV

3.0 compared with schreiber2018/fcn (46 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘035’ ‘052’ ‘059’ ‘062’ ‘074’ ‘075’ ‘079’ … CSV

3.0 compared with schreiber2018/ismir2018 (44 differences): ‘006’ ‘007’ ‘013’ ‘022’ ‘025’ ‘047’ ‘053’ ‘057’ ‘059’ ‘062’ ‘075’ … CSV

3.0 compared with sun2021/default (40 differences): ‘006’ ‘013’ ‘022’ ‘025’ ‘056’ ‘059’ ‘062’ ‘072’ ‘073’ ‘075’ ‘079’ … CSV

3.0 compared with zplane/auftakt_v3 (64 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘022’ ‘025’ ‘047’ ‘055’ ‘057’ ‘059’ ‘062’ … CSV

None of the estimators estimated the following 4 items ‘correctly’ using Accuracy₁: ‘006’ ‘013’ ‘059’ ‘137’ CSV

Differing Items Accuracy₂

Items with different tempo annotations (Accuracy₂, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (26 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘059’ ‘062’ ‘072’ ‘075’ ‘078’ ‘107’ ‘125’ … CSV

1.0 compared with boeck2019/multi_task (23 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘057’ ‘058’ ‘059’ ‘062’ ‘072’ ‘107’ ‘126’ … CSV

1.0 compared with boeck2019/multi_task_hjdb (25 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘024’ ‘057’ ‘058’ ‘059’ ‘062’ ‘072’ ‘075’ … CSV

1.0 compared with boeck2020/dar (24 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘059’ ‘062’ ‘072’ ‘075’ ‘107’ ‘122’ ‘126’ … CSV

1.0 compared with davies2009/mirex_qm_tempotracker (38 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘024’ ‘035’ ‘037’ ‘059’ ‘062’ ‘075’ ‘091’ … CSV

1.0 compared with echonest/version_3_2_1 (32 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘059’ ‘062’ ‘094’ ‘103’ ‘106’ ‘107’ ‘121’ … CSV

1.0 compared with gkiokas2012/default (34 differences): ‘003’ ‘006’ ‘010’ ‘012’ ‘013’ ‘024’ ‘037’ ‘057’ ‘059’ ‘062’ ‘091’ … CSV

1.0 compared with klapuri2006/percival2014 (35 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘057’ ‘059’ ‘062’ ‘075’ ‘091’ ‘103’ ‘107’ … CSV

1.0 compared with oliveira2010/ibt (40 differences): ‘006’ ‘007’ ‘012’ ‘024’ ‘035’ ‘058’ ‘059’ ‘062’ ‘070’ ‘075’ ‘091’ … CSV

1.0 compared with percival2014/stem (29 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘024’ ‘057’ ‘059’ ‘062’ ‘075’ ‘107’ ‘123’ … CSV

1.0 compared with scheirer1998/percival2014 (77 differences): ‘001’ ‘002’ ‘003’ ‘007’ ‘009’ ‘010’ ‘012’ ‘013’ ‘020’ ‘024’ ‘043’ … CSV

1.0 compared with schreiber2014/default (29 differences): ‘006’ ‘007’ ‘012’ ‘024’ ‘037’ ‘052’ ‘059’ ‘062’ ‘072’ ‘075’ ‘091’ … CSV

1.0 compared with schreiber2017/ismir2017 (33 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘058’ ‘059’ ‘061’ ‘062’ ‘075’ ‘091’ ‘107’ … CSV

1.0 compared with schreiber2017/mirex2017 (31 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘058’ ‘059’ ‘062’ ‘075’ ‘091’ ‘107’ ‘124’ … CSV

1.0 compared with schreiber2018/cnn (34 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘024’ ‘037’ ‘057’ ‘059’ ‘062’ ‘091’ ‘107’ … CSV

1.0 compared with schreiber2018/fcn (30 differences): ‘006’ ‘007’ ‘010’ ‘012’ ‘013’ ‘024’ ‘037’ ‘052’ ‘062’ ‘078’ ‘107’ … CSV

1.0 compared with schreiber2018/ismir2018 (35 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘024’ ‘037’ ‘057’ ‘059’ ‘062’ ‘078’ ‘091’ … CSV

1.0 compared with sun2021/default (20 differences): ‘006’ ‘012’ ‘013’ ‘024’ ‘057’ ‘059’ ‘072’ ‘075’ ‘107’ ‘127’ ‘129’ … CSV

1.0 compared with zplane/auftakt_v3 (39 differences): ‘006’ ‘007’ ‘010’ ‘012’ ‘013’ ‘024’ ‘057’ ‘059’ ‘062’ ‘075’ ‘107’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (16 differences): ‘006’ ‘013’ ‘059’ ‘072’ ‘075’ ‘107’ ‘125’ ‘126’ ‘127’ ‘132’ ‘133’ … CSV

2.0 compared with boeck2019/multi_task (15 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘072’ ‘107’ ‘126’ ‘127’ ‘132’ ‘137’ ‘138’ … CSV

2.0 compared with boeck2019/multi_task_hjdb (15 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘072’ ‘075’ ‘107’ ‘127’ ‘132’ ‘134’ ‘137’ … CSV

2.0 compared with boeck2020/dar (12 differences): ‘006’ ‘059’ ‘072’ ‘075’ ‘122’ ‘126’ ‘133’ ‘137’ ‘139’ ‘140’ ‘150’ … CSV

2.0 compared with davies2009/mirex_qm_tempotracker (30 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘059’ ‘075’ ‘091’ ‘122’ ‘123’ ‘124’ ‘126’ … CSV

2.0 compared with echonest/version_3_2_1 (26 differences): ‘006’ ‘013’ ‘024’ ‘059’ ‘094’ ‘103’ ‘106’ ‘107’ ‘122’ ‘123’ ‘124’ … CSV

2.0 compared with gkiokas2012/default (26 differences): ‘003’ ‘006’ ‘010’ ‘013’ ‘059’ ‘091’ ‘107’ ‘121’ ‘122’ ‘124’ ‘125’ … CSV

2.0 compared with klapuri2006/percival2014 (32 differences): ‘006’ ‘013’ ‘057’ ‘059’ ‘075’ ‘090’ ‘091’ ‘103’ ‘107’ ‘121’ ‘122’ … CSV

2.0 compared with oliveira2010/ibt (36 differences): ‘006’ ‘007’ ‘024’ ‘058’ ‘059’ ‘070’ ‘075’ ‘091’ ‘103’ ‘121’ ‘122’ … CSV

2.0 compared with percival2014/stem (21 differences): ‘006’ ‘007’ ‘013’ ‘057’ ‘059’ ‘075’ ‘123’ ‘124’ ‘125’ ‘126’ ‘127’ … CSV

2.0 compared with scheirer1998/percival2014 (70 differences): ‘001’ ‘002’ ‘003’ ‘006’ ‘007’ ‘009’ ‘010’ ‘012’ ‘013’ ‘020’ ‘043’ … CSV

2.0 compared with schreiber2014/default (26 differences): ‘006’ ‘007’ ‘024’ ‘052’ ‘059’ ‘072’ ‘075’ ‘084’ ‘091’ ‘123’ ‘124’ … CSV

2.0 compared with schreiber2017/ismir2017 (24 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘061’ ‘075’ ‘091’ ‘124’ ‘125’ ‘126’ ‘127’ … CSV

2.0 compared with schreiber2017/mirex2017 (23 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘075’ ‘091’ ‘124’ ‘125’ ‘126’ ‘127’ ‘128’ … CSV

2.0 compared with schreiber2018/cnn (28 differences): ‘006’ ‘007’ ‘013’ ‘059’ ‘091’ ‘107’ ‘121’ ‘122’ ‘123’ ‘124’ ‘125’ … CSV

2.0 compared with schreiber2018/fcn (24 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘052’ ‘107’ ‘121’ ‘124’ ‘125’ ‘126’ ‘127’ … CSV

2.0 compared with schreiber2018/ismir2018 (27 differences): ‘006’ ‘007’ ‘013’ ‘057’ ‘059’ ‘091’ ‘107’ ‘122’ ‘123’ ‘125’ ‘126’ … CSV

2.0 compared with sun2021/default (16 differences): ‘006’ ‘013’ ‘057’ ‘059’ ‘072’ ‘075’ ‘127’ ‘137’ ‘138’ ‘139’ ‘140’ … CSV

2.0 compared with zplane/auftakt_v3 (34 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘057’ ‘059’ ‘075’ ‘107’ ‘121’ ‘122’ ‘123’ … CSV

3.0 compared with boeck2015/tempodetector2016_default (14 differences): ‘006’ ‘013’ ‘059’ ‘072’ ‘075’ ‘107’ ‘126’ ‘127’ ‘133’ ‘137’ ‘138’ … CSV

3.0 compared with boeck2019/multi_task (16 differences): ‘006’ ‘058’ ‘059’ ‘072’ ‘107’ ‘122’ ‘126’ ‘127’ ‘132’ ‘133’ ‘137’ … CSV

3.0 compared with boeck2019/multi_task_hjdb (16 differences): ‘006’ ‘058’ ‘059’ ‘072’ ‘075’ ‘107’ ‘122’ ‘127’ ‘132’ ‘133’ ‘134’ … CSV

3.0 compared with boeck2020/dar (13 differences): ‘006’ ‘059’ ‘072’ ‘075’ ‘107’ ‘122’ ‘126’ ‘133’ ‘137’ ‘139’ ‘140’ … CSV

3.0 compared with davies2009/mirex_qm_tempotracker (32 differences): ‘006’ ‘007’ ‘012’ ‘013’ ‘059’ ‘075’ ‘091’ ‘122’ ‘123’ ‘124’ ‘125’ … CSV

3.0 compared with echonest/version_3_2_1 (27 differences): ‘006’ ‘007’ ‘013’ ‘024’ ‘059’ ‘094’ ‘103’ ‘106’ ‘107’ ‘122’ ‘123’ … CSV

3.0 compared with gkiokas2012/default (27 differences): ‘003’ ‘006’ ‘010’ ‘013’ ‘059’ ‘091’ ‘107’ ‘121’ ‘122’ ‘124’ ‘125’ … CSV

3.0 compared with klapuri2006/percival2014 (32 differences): ‘006’ ‘013’ ‘057’ ‘059’ ‘075’ ‘090’ ‘091’ ‘103’ ‘107’ ‘121’ ‘122’ … CSV

3.0 compared with oliveira2010/ibt (36 differences): ‘006’ ‘007’ ‘024’ ‘058’ ‘059’ ‘070’ ‘075’ ‘091’ ‘103’ ‘121’ ‘122’ … CSV

3.0 compared with percival2014/stem (24 differences): ‘006’ ‘007’ ‘013’ ‘059’ ‘075’ ‘107’ ‘121’ ‘122’ ‘123’ ‘124’ ‘125’ … CSV

3.0 compared with scheirer1998/percival2014 (69 differences): ‘001’ ‘002’ ‘003’ ‘006’ ‘007’ ‘009’ ‘010’ ‘012’ ‘020’ ‘043’ ‘058’ … CSV

3.0 compared with schreiber2014/default (26 differences): ‘006’ ‘024’ ‘052’ ‘059’ ‘072’ ‘075’ ‘084’ ‘091’ ‘107’ ‘122’ ‘123’ … CSV

3.0 compared with schreiber2017/ismir2017 (24 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘061’ ‘075’ ‘091’ ‘107’ ‘124’ ‘125’ ‘126’ … CSV

3.0 compared with schreiber2017/mirex2017 (24 differences): ‘006’ ‘013’ ‘058’ ‘059’ ‘075’ ‘091’ ‘107’ ‘124’ ‘125’ ‘126’ ‘127’ … CSV

3.0 compared with schreiber2018/cnn (29 differences): ‘006’ ‘007’ ‘013’ ‘059’ ‘091’ ‘107’ ‘121’ ‘122’ ‘123’ ‘124’ ‘125’ … CSV

3.0 compared with schreiber2018/fcn (24 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘052’ ‘107’ ‘121’ ‘124’ ‘126’ ‘127’ ‘129’ … CSV

3.0 compared with schreiber2018/ismir2018 (28 differences): ‘006’ ‘007’ ‘013’ ‘057’ ‘059’ ‘091’ ‘107’ ‘122’ ‘123’ ‘125’ ‘126’ … CSV

3.0 compared with sun2021/default (16 differences): ‘006’ ‘013’ ‘059’ ‘072’ ‘075’ ‘121’ ‘127’ ‘133’ ‘137’ ‘138’ ‘139’ … CSV

3.0 compared with zplane/auftakt_v3 (35 differences): ‘006’ ‘007’ ‘010’ ‘013’ ‘057’ ‘059’ ‘075’ ‘107’ ‘112’ ‘121’ ‘122’ … CSV

All tracks were estimated ‘correctly’ by at least one system.

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.8318	0.8450	1.0000	0.0026	0.0000	0.0000	0.0029	0.0064	0.0003	0.0000	0.0013	0.0033	0.0201	0.2110	0.2430	0.2478	1.0000	0.0003
boeck2019/multi_task	0.8318	1.0000	1.0000	0.6776	0.0076	0.0002	0.0000	0.0079	0.0139	0.0013	0.0000	0.0078	0.0167	0.0596	0.3368	0.4408	0.4731	1.0000	0.0009
boeck2019/multi_task_hjdb	0.8450	1.0000	1.0000	0.6636	0.0060	0.0003	0.0000	0.0064	0.0113	0.0013	0.0000	0.0078	0.0201	0.0660	0.3105	0.4408	0.4583	1.0000	0.0007
boeck2020/dar	1.0000	0.6776	0.6636	1.0000	0.0037	0.0000	0.0000	0.0025	0.0054	0.0003	0.0000	0.0022	0.0039	0.0139	0.1214	0.1996	0.1849	0.8388	0.0002
davies2009/mirex_qm_tempotracker	0.0026	0.0076	0.0060	0.0037	1.0000	0.1337	0.0115	1.0000	1.0000	0.5114	0.0000	0.7838	0.8714	0.6177	0.1214	0.1173	0.0227	0.0175	0.4049
echonest/version_3_2_1	0.0000	0.0002	0.0003	0.0000	0.1337	1.0000	0.5114	0.1352	0.1048	0.3489	0.0000	0.2221	0.0436	0.0226	0.0038	0.0008	0.0009	0.0002	0.3916
gkiokas2012/default	0.0000	0.0000	0.0000	0.0000	0.0115	0.5114	1.0000	0.0139	0.0114	0.0730	0.0001	0.0488	0.0079	0.0031	0.0001	0.0000	0.0000	0.0000	0.0807
klapuri2006/percival2014	0.0029	0.0079	0.0064	0.0025	1.0000	0.1352	0.0139	1.0000	0.8145	0.5413	0.0000	0.8804	0.7428	0.5224	0.0895	0.0807	0.0146	0.0145	0.5413
oliveira2010/ibt	0.0064	0.0139	0.0113	0.0054	1.0000	0.1048	0.0114	0.8145	1.0000	0.3915	0.0000	0.6885	1.0000	0.7428	0.1755	0.1641	0.0522	0.0270	0.3269
percival2014/stem	0.0003	0.0013	0.0013	0.0003	0.5114	0.3489	0.0730	0.5413	0.3915	1.0000	0.0000	0.8714	0.2962	0.1877	0.0357	0.0195	0.0060	0.0018	1.0000
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0001	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0013	0.0078	0.0078	0.0022	0.7838	0.2221	0.0488	0.8804	0.6885	0.8714	0.0000	1.0000	0.5224	0.3489	0.0854	0.0595	0.0444	0.0065	0.8776
schreiber2017/ismir2017	0.0033	0.0167	0.0201	0.0039	0.8714	0.0436	0.0079	0.7428	1.0000	0.2962	0.0000	0.5224	1.0000	0.8036	0.2800	0.1996	0.1214	0.0259	0.3240
schreiber2017/mirex2017	0.0201	0.0596	0.0660	0.0139	0.6177	0.0226	0.0031	0.5224	0.7428	0.1877	0.0000	0.3489	0.8036	1.0000	0.4408	0.3368	0.2559	0.0649	0.1755
schreiber2018/cnn	0.2110	0.3368	0.3105	0.1214	0.1214	0.0038	0.0001	0.0895	0.1755	0.0357	0.0000	0.0854	0.2800	0.4408	1.0000	1.0000	0.8388	0.2682	0.0315
schreiber2018/fcn	0.2430	0.4408	0.4408	0.1996	0.1173	0.0008	0.0000	0.0807	0.1641	0.0195	0.0000	0.0595	0.1996	0.3368	1.0000	1.0000	1.0000	0.3489	0.0195
schreiber2018/ismir2018	0.2478	0.4731	0.4583	0.1849	0.0227	0.0009	0.0000	0.0146	0.0522	0.0060	0.0000	0.0444	0.1214	0.2559	0.8388	1.0000	1.0000	0.4177	0.0033
sun2021/default	1.0000	1.0000	1.0000	0.8388	0.0175	0.0002	0.0000	0.0145	0.0270	0.0018	0.0000	0.0065	0.0259	0.0649	0.2682	0.3489	0.4177	1.0000	0.0018
zplane/auftakt_v3	0.0003	0.0009	0.0007	0.0002	0.4049	0.3916	0.0807	0.5413	0.3269	1.0000	0.0000	0.8776	0.3240	0.1755	0.0315	0.0195	0.0033	0.0018	1.0000

Table 6: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.3075	0.5716	1.0000	0.0016	0.0000	0.0000	0.0001	0.0004	0.0000	0.0000	0.0000	0.0001	0.0006	0.0237	0.0596	0.0708	0.3915	0.0000
boeck2019/multi_task	0.3075	1.0000	0.7905	0.3449	0.0226	0.0001	0.0001	0.0031	0.0079	0.0005	0.0000	0.0001	0.0060	0.0161	0.2110	0.3713	0.4996	1.0000	0.0005
boeck2019/multi_task_hjdb	0.5716	0.7905	1.0000	0.5572	0.0064	0.0000	0.0000	0.0008	0.0029	0.0002	0.0000	0.0001	0.0034	0.0066	0.1081	0.2327	0.2962	0.8601	0.0001
boeck2020/dar	1.0000	0.3449	0.5572	1.0000	0.0026	0.0000	0.0000	0.0002	0.0009	0.0000	0.0000	0.0000	0.0006	0.0008	0.0167	0.0596	0.0801	0.3449	0.0000
davies2009/mirex_qm_tempotracker	0.0016	0.0226	0.0064	0.0026	1.0000	0.0722	0.0161	0.3833	0.8036	0.1877	0.0000	0.1480	1.0000	1.0000	0.3240	0.2430	0.0614	0.0440	0.1221
echonest/version_3_2_1	0.0000	0.0001	0.0000	0.0000	0.0722	1.0000	0.8776	0.1934	0.1048	0.4408	0.0000	0.7709	0.0488	0.0595	0.0046	0.0009	0.0007	0.0002	0.4799
gkiokas2012/default	0.0000	0.0001	0.0000	0.0000	0.0161	0.8776	1.0000	0.0730	0.0357	0.2682	0.0000	0.5682	0.0328	0.0293	0.0005	0.0003	0.0001	0.0001	0.2912
klapuri2006/percival2014	0.0001	0.0031	0.0008	0.0002	0.3833	0.1934	0.0730	1.0000	0.6476	0.5413	0.0000	0.3817	0.6358	0.6358	0.0652	0.0436	0.0037	0.0099	0.5572
oliveira2010/ibt	0.0004	0.0079	0.0029	0.0009	0.8036	0.1048	0.0357	0.6476	1.0000	0.3240	0.0000	0.2370	1.0000	1.0000	0.1628	0.1263	0.0241	0.0247	0.2478
percival2014/stem	0.0000	0.0005	0.0002	0.0000	0.1877	0.4408	0.2682	0.5413	0.3240	1.0000	0.0000	0.7493	0.2295	0.2559	0.0226	0.0096	0.0022	0.0013	1.0000
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0001	0.0001	0.0000	0.1480	0.7709	0.5682	0.3817	0.2370	0.7493	0.0000	1.0000	0.1081	0.1173	0.0110	0.0055	0.0022	0.0004	0.7608
schreiber2017/ismir2017	0.0001	0.0060	0.0034	0.0006	1.0000	0.0488	0.0328	0.6358	1.0000	0.2295	0.0000	0.1081	1.0000	1.0000	0.2800	0.1641	0.0652	0.0241	0.2800
schreiber2017/mirex2017	0.0006	0.0161	0.0066	0.0008	1.0000	0.0595	0.0293	0.6358	1.0000	0.2559	0.0000	0.1173	1.0000	1.0000	0.2682	0.1433	0.0730	0.0270	0.2430
schreiber2018/cnn	0.0237	0.2110	0.1081	0.0167	0.3240	0.0046	0.0005	0.0652	0.1628	0.0226	0.0000	0.0110	0.2800	0.2682	1.0000	0.8450	0.5413	0.2221	0.0195
schreiber2018/fcn	0.0596	0.3713	0.2327	0.0596	0.2430	0.0009	0.0003	0.0436	0.1263	0.0096	0.0000	0.0055	0.1641	0.1433	0.8450	1.0000	0.8450	0.3604	0.0079
schreiber2018/ismir2018	0.0708	0.4996	0.2962	0.0801	0.0614	0.0007	0.0001	0.0037	0.0241	0.0022	0.0000	0.0022	0.0652	0.0730	0.5413	0.8450	1.0000	0.5224	0.0005
sun2021/default	0.3915	1.0000	0.8601	0.3449	0.0440	0.0002	0.0001	0.0099	0.0247	0.0013	0.0000	0.0004	0.0241	0.0270	0.2221	0.3604	0.5224	1.0000	0.0010
zplane/auftakt_v3	0.0000	0.0005	0.0001	0.0000	0.1221	0.4799	0.2912	0.5572	0.2478	1.0000	0.0000	0.7608	0.2800	0.2430	0.0195	0.0079	0.0005	0.0010	1.0000

Table 7: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.0639	0.1849	0.8506	0.0007	0.0000	0.0000	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000	0.0001	0.0079	0.0275	0.0351	0.1496	0.0000
boeck2019/multi_task	0.0639	1.0000	0.7905	0.1686	0.0436	0.0001	0.0001	0.0066	0.0166	0.0011	0.0000	0.0007	0.0167	0.0195	0.2682	0.5424	0.7283	1.0000	0.0011
boeck2019/multi_task_hjdb	0.1849	0.7905	1.0000	0.3075	0.0139	0.0000	0.0000	0.0019	0.0064	0.0005	0.0000	0.0004	0.0095	0.0079	0.1325	0.3604	0.4731	1.0000	0.0002
boeck2020/dar	0.8506	0.1686	0.3075	1.0000	0.0026	0.0000	0.0000	0.0002	0.0009	0.0000	0.0000	0.0000	0.0006	0.0004	0.0113	0.0596	0.0801	0.2478	0.0000
davies2009/mirex_qm_tempotracker	0.0007	0.0436	0.0139	0.0026	1.0000	0.0519	0.0096	0.3833	0.8036	0.1877	0.0000	0.2288	1.0000	0.8746	0.4050	0.2430	0.0614	0.0534	0.1221
echonest/version_3_2_1	0.0000	0.0001	0.0000	0.0000	0.0519	1.0000	0.8746	0.1439	0.0759	0.3489	0.0000	0.4614	0.0357	0.0595	0.0038	0.0005	0.0004	0.0002	0.3916
gkiokas2012/default	0.0000	0.0001	0.0000	0.0000	0.0096	0.8746	1.0000	0.0470	0.0226	0.1877	0.0000	0.3020	0.0213	0.0259	0.0005	0.0001	0.0001	0.0001	0.2221
klapuri2006/percival2014	0.0000	0.0066	0.0019	0.0002	0.3833	0.1439	0.0470	1.0000	0.6476	0.5413	0.0000	0.5515	0.6358	0.7493	0.0895	0.0436	0.0037	0.0119	0.5572
oliveira2010/ibt	0.0002	0.0166	0.0064	0.0009	0.8036	0.0759	0.0226	0.6476	1.0000	0.3105	0.0000	0.3497	1.0000	1.0000	0.2153	0.1263	0.0241	0.0270	0.2478
percival2014/stem	0.0000	0.0011	0.0005	0.0000	0.1877	0.3489	0.1877	0.5413	0.3105	1.0000	0.0000	1.0000	0.2295	0.3105	0.0315	0.0079	0.0022	0.0018	1.0000
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0007	0.0004	0.0000	0.2288	0.4614	0.3020	0.5515	0.3497	1.0000	0.0000	1.0000	0.1877	0.2430	0.0328	0.0110	0.0046	0.0019	1.0000
schreiber2017/ismir2017	0.0000	0.0167	0.0095	0.0006	1.0000	0.0357	0.0213	0.6358	1.0000	0.2295	0.0000	0.1877	1.0000	1.0000	0.3604	0.1641	0.0652	0.0328	0.2800
schreiber2017/mirex2017	0.0001	0.0195	0.0079	0.0004	0.8746	0.0595	0.0259	0.7493	1.0000	0.3105	0.0000	0.2430	1.0000	1.0000	0.2559	0.0989	0.0470	0.0213	0.3105
schreiber2018/cnn	0.0079	0.2682	0.1325	0.0113	0.4050	0.0038	0.0005	0.0895	0.2153	0.0315	0.0000	0.0328	0.3604	0.2559	1.0000	0.6900	0.4049	0.1996	0.0275
schreiber2018/fcn	0.0275	0.5424	0.3604	0.0596	0.2430	0.0005	0.0001	0.0436	0.1263	0.0079	0.0000	0.0110	0.1641	0.0989	0.6900	1.0000	0.8450	0.4177	0.0079
schreiber2018/ismir2018	0.0351	0.7283	0.4731	0.0801	0.0614	0.0004	0.0001	0.0037	0.0241	0.0022	0.0000	0.0046	0.0652	0.0470	0.4049	0.8450	1.0000	0.6177	0.0005
sun2021/default	0.1496	1.0000	1.0000	0.2478	0.0534	0.0002	0.0001	0.0119	0.0270	0.0018	0.0000	0.0019	0.0328	0.0213	0.1996	0.4177	0.6177	1.0000	0.0015
zplane/auftakt_v3	0.0000	0.0011	0.0002	0.0000	0.1221	0.3916	0.2221	0.5572	0.2478	1.0000	0.0000	1.0000	0.2800	0.3105	0.0275	0.0079	0.0005	0.0015	1.0000

Table 8: McNemar p-values, using reference annotations 3.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5078	1.0000	0.6250	0.0118	0.2101	0.0963	0.0636	0.0066	0.5488	0.0000	0.6072	0.0923	0.2668	0.1153	0.4545	0.0352	0.1094	0.0106
boeck2019/multi_task	0.5078	1.0000	0.6250	1.0000	0.0026	0.0490	0.0127	0.0169	0.0009	0.1460	0.0000	0.2632	0.0129	0.0574	0.0266	0.1892	0.0042	0.5078	0.0015
boeck2019/multi_task_hjdb	1.0000	0.6250	1.0000	1.0000	0.0072	0.1892	0.0784	0.0525	0.0041	0.3437	0.0000	0.4807	0.0768	0.2101	0.0784	0.3833	0.0309	0.1797	0.0066
boeck2020/dar	0.6250	1.0000	1.0000	1.0000	0.0013	0.0574	0.0309	0.0192	0.0015	0.2266	0.0000	0.3018	0.0225	0.0923	0.0309	0.2632	0.0127	0.2891	0.0015
davies2009/mirex_qm_tempotracker	0.0118	0.0026	0.0072	0.0013	1.0000	0.2863	0.4807	0.6476	0.8036	0.0225	0.0000	0.0636	0.3323	0.1435	0.4545	0.1338	0.6072	0.0003	1.0000
echonest/version_3_2_1	0.2101	0.0490	0.1892	0.0574	0.2863	1.0000	0.8145	0.6072	0.1153	0.6072	0.0000	0.6776	1.0000	1.0000	0.8145	0.8145	0.6291	0.0118	0.1671
gkiokas2012/default	0.0963	0.0127	0.0784	0.0309	0.4807	0.8145	1.0000	1.0000	0.2863	0.3323	0.0000	0.4049	1.0000	0.6072	1.0000	0.4807	1.0000	0.0043	0.3323
klapuri2006/percival2014	0.0636	0.0169	0.0525	0.0192	0.6476	0.6072	1.0000	1.0000	0.3323	0.2632	0.0000	0.3269	0.8036	0.4545	1.0000	0.3593	1.0000	0.0015	0.5235
oliveira2010/ibt	0.0066	0.0009	0.0041	0.0015	0.8036	0.1153	0.2863	0.3323	1.0000	0.0192	0.0000	0.0522	0.1671	0.0636	0.2632	0.0872	0.3593	0.0005	1.0000
percival2014/stem	0.5488	0.1460	0.3437	0.2266	0.0225	0.6072	0.3323	0.2632	0.0192	1.0000	0.0000	1.0000	0.3877	0.7744	0.3018	1.0000	0.1094	0.0225	0.0129
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.6072	0.2632	0.4807	0.3018	0.0636	0.6776	0.4049	0.3269	0.0522	1.0000	0.0000	1.0000	0.5034	0.8238	0.3593	1.0000	0.2632	0.0490	0.0755
schreiber2017/ismir2017	0.0923	0.0129	0.0768	0.0225	0.3323	1.0000	1.0000	0.8036	0.1671	0.3877	0.0000	0.5034	1.0000	0.5000	1.0000	0.6636	0.8145	0.0044	0.2101
schreiber2017/mirex2017	0.2668	0.0574	0.2101	0.0923	0.1435	1.0000	0.6072	0.4545	0.0636	0.7744	0.0000	0.8238	0.5000	1.0000	0.6291	1.0000	0.4807	0.0192	0.0768
schreiber2018/cnn	0.1153	0.0266	0.0784	0.0309	0.4545	0.8145	1.0000	1.0000	0.2632	0.3018	0.0000	0.3593	1.0000	0.6291	1.0000	0.4807	1.0000	0.0043	0.3323
schreiber2018/fcn	0.4545	0.1892	0.3833	0.2632	0.1338	0.8145	0.4807	0.3593	0.0872	1.0000	0.0000	1.0000	0.6636	1.0000	0.4807	1.0000	0.3323	0.0414	0.0931
schreiber2018/ismir2018	0.0352	0.0042	0.0309	0.0127	0.6072	0.6291	1.0000	1.0000	0.3593	0.1094	0.0000	0.2632	0.8145	0.4807	1.0000	0.3323	1.0000	0.0007	0.5034
sun2021/default	0.1094	0.5078	0.1797	0.2891	0.0003	0.0118	0.0043	0.0015	0.0005	0.0225	0.0000	0.0490	0.0044	0.0192	0.0043	0.0414	0.0007	1.0000	0.0003
zplane/auftakt_v3	0.0106	0.0015	0.0066	0.0015	1.0000	0.1671	0.3323	0.5235	1.0000	0.0129	0.0000	0.0755	0.2101	0.0768	0.3323	0.0931	0.5034	0.0003	1.0000

Table 9: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	1.0000	1.0000	0.2891	0.0094	0.0309	0.0213	0.0015	0.0001	0.3018	0.0000	0.0309	0.0386	0.0654	0.0042	0.0963	0.0127	1.0000	0.0003
boeck2019/multi_task	1.0000	1.0000	1.0000	0.5488	0.0059	0.0192	0.0074	0.0023	0.0000	0.2379	0.0000	0.0266	0.0225	0.0386	0.0044	0.0636	0.0075	1.0000	0.0003
boeck2019/multi_task_hjdb	1.0000	1.0000	1.0000	0.5488	0.0059	0.0266	0.0192	0.0015	0.0001	0.2379	0.0000	0.0266	0.0352	0.0574	0.0072	0.0784	0.0169	1.0000	0.0005
boeck2020/dar	0.2891	0.5488	0.5488	1.0000	0.0003	0.0043	0.0043	0.0001	0.0000	0.0636	0.0000	0.0013	0.0042	0.0074	0.0004	0.0169	0.0007	0.3437	0.0000
davies2009/mirex_qm_tempotracker	0.0094	0.0059	0.0059	0.0003	1.0000	0.5413	0.5235	0.8318	0.2379	0.0352	0.0000	0.5572	0.2863	0.1892	0.8145	0.3075	0.6476	0.0094	0.5413
echonest/version_3_2_1	0.0309	0.0192	0.0266	0.0043	0.5413	1.0000	1.0000	0.2379	0.0309	0.3323	0.0000	1.0000	0.8238	0.6476	0.8145	0.8318	1.0000	0.0755	0.1338
gkiokas2012/default	0.0213	0.0074	0.0192	0.0043	0.5235	1.0000	1.0000	0.2632	0.0414	0.3593	0.0000	1.0000	0.7905	0.5811	0.7905	0.8145	1.0000	0.0639	0.0963
klapuri2006/percival2014	0.0015	0.0023	0.0015	0.0001	0.8318	0.2379	0.2632	1.0000	0.4807	0.0347	0.0000	0.3915	0.1516	0.0931	0.4807	0.1338	0.4049	0.0025	0.8318
oliveira2010/ibt	0.0001	0.0000	0.0001	0.0000	0.2379	0.0309	0.0414	0.4807	1.0000	0.0007	0.0000	0.0755	0.0118	0.0044	0.0768	0.0227	0.0636	0.0005	0.8238
percival2014/stem	0.3018	0.2379	0.2379	0.0636	0.0352	0.3323	0.3593	0.0347	0.0007	1.0000	0.0000	0.3833	0.6072	0.7905	0.1185	0.6476	0.1796	0.3593	0.0010
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0309	0.0266	0.0266	0.0013	0.5572	1.0000	1.0000	0.3915	0.0755	0.3833	0.0000	1.0000	0.8238	0.6476	0.8318	0.8450	1.0000	0.0309	0.1686
schreiber2017/ismir2017	0.0386	0.0225	0.0352	0.0042	0.2863	0.8238	0.7905	0.1516	0.0118	0.6072	0.0000	0.8238	1.0000	1.0000	0.4545	1.0000	0.6291	0.0963	0.0414
schreiber2017/mirex2017	0.0654	0.0386	0.0574	0.0074	0.1892	0.6476	0.5811	0.0931	0.0044	0.7905	0.0000	0.6476	1.0000	1.0000	0.3018	1.0000	0.4545	0.1435	0.0192
schreiber2018/cnn	0.0042	0.0044	0.0072	0.0004	0.8145	0.8145	0.7905	0.4807	0.0768	0.1185	0.0000	0.8318	0.4545	0.3018	1.0000	0.4240	1.0000	0.0227	0.2101
schreiber2018/fcn	0.0963	0.0636	0.0784	0.0169	0.3075	0.8318	0.8145	0.1338	0.0227	0.6476	0.0000	0.8450	1.0000	1.0000	0.4240	1.0000	0.6476	0.1153	0.0414
schreiber2018/ismir2018	0.0127	0.0075	0.0169	0.0007	0.6476	1.0000	1.0000	0.4049	0.0636	0.1796	0.0000	1.0000	0.6291	0.4545	1.0000	0.6476	1.0000	0.0266	0.1435
sun2021/default	1.0000	1.0000	1.0000	0.3437	0.0094	0.0755	0.0639	0.0025	0.0005	0.3593	0.0000	0.0309	0.0963	0.1435	0.0227	0.1153	0.0266	1.0000	0.0009
zplane/auftakt_v3	0.0003	0.0003	0.0005	0.0000	0.5413	0.1338	0.0963	0.8318	0.8238	0.0010	0.0000	0.1686	0.0414	0.0192	0.2101	0.0414	0.1435	0.0009	1.0000

Table 10: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.6875	0.6875	1.0000	0.0003	0.0072	0.0044	0.0003	0.0000	0.0213	0.0000	0.0018	0.0129	0.0063	0.0007	0.0309	0.0026	0.6875	0.0000
boeck2019/multi_task	0.6875	1.0000	1.0000	0.4531	0.0025	0.0192	0.0127	0.0037	0.0000	0.0963	0.0000	0.0129	0.0574	0.0386	0.0044	0.1338	0.0075	1.0000	0.0002
boeck2019/multi_task_hjdb	0.6875	1.0000	1.0000	0.4531	0.0025	0.0266	0.0266	0.0025	0.0001	0.0963	0.0000	0.0129	0.0768	0.0574	0.0072	0.1516	0.0169	1.0000	0.0003
boeck2020/dar	1.0000	0.4531	0.4531	1.0000	0.0002	0.0043	0.0043	0.0002	0.0000	0.0127	0.0000	0.0002	0.0127	0.0074	0.0004	0.0266	0.0007	0.5078	0.0000
davies2009/mirex_qm_tempotracker	0.0003	0.0025	0.0025	0.0002	1.0000	0.3833	0.3593	1.0000	0.4240	0.0574	0.0000	0.2863	0.1153	0.0963	0.5811	0.1338	0.4545	0.0015	0.6476
echonest/version_3_2_1	0.0072	0.0192	0.0266	0.0043	0.3833	1.0000	1.0000	0.3593	0.0490	0.5811	0.0000	1.0000	0.6636	0.6476	0.8036	0.6636	1.0000	0.0347	0.1153
gkiokas2012/default	0.0044	0.0127	0.0266	0.0043	0.3593	1.0000	1.0000	0.3593	0.0636	0.6291	0.0000	1.0000	0.5811	0.5811	0.7905	0.6291	1.0000	0.0266	0.0963
klapuri2006/percival2014	0.0003	0.0037	0.0025	0.0002	1.0000	0.3593	0.3593	1.0000	0.4807	0.1153	0.0000	0.3449	0.1516	0.1338	0.6291	0.1338	0.5235	0.0015	0.6776
oliveira2010/ibt	0.0000	0.0000	0.0001	0.0000	0.4240	0.0490	0.0636	0.4807	1.0000	0.0042	0.0000	0.0639	0.0169	0.0118	0.1185	0.0227	0.0963	0.0002	1.0000
percival2014/stem	0.0213	0.0963	0.0963	0.0127	0.0574	0.5811	0.6291	0.1153	0.0042	1.0000	0.0000	0.8238	1.0000	1.0000	0.2266	1.0000	0.4240	0.1153	0.0010
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0018	0.0129	0.0129	0.0002	0.2863	1.0000	1.0000	0.3449	0.0639	0.8238	0.0000	1.0000	0.8145	0.8036	0.6476	0.8388	0.8238	0.0309	0.1078
schreiber2017/ismir2017	0.0129	0.0574	0.0768	0.0127	0.1153	0.6636	0.5811	0.1516	0.0169	1.0000	0.0000	0.8145	1.0000	1.0000	0.3323	1.0000	0.4807	0.0963	0.0192
schreiber2017/mirex2017	0.0063	0.0386	0.0574	0.0074	0.0963	0.6476	0.5811	0.1338	0.0118	1.0000	0.0000	0.8036	1.0000	1.0000	0.3018	1.0000	0.4545	0.0768	0.0127
schreiber2018/cnn	0.0007	0.0044	0.0072	0.0004	0.5811	0.8036	0.7905	0.6291	0.1185	0.2266	0.0000	0.6476	0.3323	0.3018	1.0000	0.2668	1.0000	0.0044	0.1796
schreiber2018/fcn	0.0309	0.1338	0.1516	0.0266	0.1338	0.6636	0.6291	0.1338	0.0227	1.0000	0.0000	0.8388	1.0000	1.0000	0.2668	1.0000	0.5034	0.0963	0.0192
schreiber2018/ismir2018	0.0026	0.0075	0.0169	0.0007	0.4545	1.0000	1.0000	0.5235	0.0963	0.4240	0.0000	0.8238	0.4807	0.4545	1.0000	0.5034	1.0000	0.0118	0.1435
sun2021/default	0.6875	1.0000	1.0000	0.5078	0.0015	0.0347	0.0266	0.0015	0.0002	0.1153	0.0000	0.0309	0.0963	0.0768	0.0044	0.0963	0.0118	1.0000	0.0002
zplane/auftakt_v3	0.0000	0.0002	0.0003	0.0000	0.6476	0.1153	0.0963	0.6776	1.0000	0.0010	0.0000	0.1078	0.0192	0.0127	0.1796	0.0192	0.1435	0.0002	1.0000

Table 11: McNemar p-values, using reference annotations 3.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Accuracy₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 11: Mean Accuracy₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

Accuracy₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 12: Mean Accuracy₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

Accuracy₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 13: Mean Accuracy₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

Accuracy₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 14: Mean Accuracy₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

Accuracy₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 15: Mean Accuracy₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

Accuracy₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 16: Mean Accuracy₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

Accuracy₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₁ on Tempo-Subsets for 1.0

Figure 17: Mean Accuracy₁ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₁ on Tempo-Subsets for 2.0

Figure 18: Mean Accuracy₁ for estimates compared to version 2.0 for tempo intervals around T.

Accuracy₁ on Tempo-Subsets for 3.0

Figure 19: Mean Accuracy₁ for estimates compared to version 3.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₂ on Tempo-Subsets for 1.0

Figure 20: Mean Accuracy₂ for estimates compared to version 1.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets for 2.0

Figure 21: Mean Accuracy₂ for estimates compared to version 2.0 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets for 3.0

Figure 22: Mean Accuracy₂ for estimates compared to version 3.0 for tempo intervals around T.

Estimated Accuracy₁ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₁-values and a ground truth, what Accuracy₁ can we expect with confidence?

Estimated Accuracy₁ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 1.0.

Figure 23: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₁ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 2.0.

Figure 24: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₁ for Tempo for 3.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 3.0.

Figure 25: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₂-values and a ground truth, what Accuracy₂ can we expect with confidence?

Estimated Accuracy₂ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 1.0.

Figure 26: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 2.0.

Figure 27: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo for 3.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 3.0.

Figure 28: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

Accuracy₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₁ for ‘tag_open’ Tags for 1.0

Figure 29: Mean Accuracy₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₁ for ‘tag_open’ Tags for 2.0

Figure 30: Mean Accuracy₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

Accuracy₁ for ‘tag_open’ Tags for 3.0

Figure 31: Mean Accuracy₁ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₂ for ‘tag_open’ Tags for 1.0

Figure 32: Mean Accuracy₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags for 2.0

Figure 33: Mean Accuracy₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

Accuracy₂ for ‘tag_open’ Tags for 3.0

Figure 34: Mean Accuracy₂ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.

OE₁ and OE₂

OE₁ is defined as octave error between an estimate E and a reference value R.This means that the most common errors—by a factor of 2 or ½—have the same magnitude, namely 1: OE₂(E) = log₂(E/R).

OE₂ is the signed OE₁ corresponding to the minimum absolute OE₁ allowing the octaveerrors 2, 3, 1/2, and 1/3: OE₂(E) = arg min_x(|x|) with x ∈ {OE₁(E), OE₁(2E), OE₁(3E), OE₁(½E), OE₁(⅓E)}

Mean OE₁/OE₂ Results for 1.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2015/tempodetector2016_default	-0.0178	0.3321	0.0092	0.0843
schreiber2018/ismir2018	0.0251	0.3390	0.0090	0.1093
boeck2020/dar	0.0025	0.3515	0.0078	0.0837
schreiber2018/cnn	0.0402	0.3517	0.0150	0.1248
boeck2019/multi_task_hjdb	-0.0100	0.3579	0.0132	0.0946
sun2021/default	0.0177	0.3594	0.0087	0.0914
boeck2019/multi_task	-0.0042	0.3690	0.0138	0.0918
oliveira2010/ibt	0.0633	0.3835	0.0002	0.1433
schreiber2018/fcn	0.0136	0.4133	0.0147	0.0950
schreiber2014/default	-0.1485	0.4165	0.0002	0.1083
klapuri2006/percival2014	0.0366	0.4218	0.0096	0.1268
schreiber2017/ismir2017	-0.0908	0.4244	0.0117	0.1149
zplane/auftakt_v3	-0.0050	0.4379	-0.0005	0.1391
echonest/version_3_2_1	-0.1828	0.4396	0.0009	0.1036
davies2009/mirex_qm_tempotracker	0.1669	0.4416	0.0095	0.1098
percival2014/stem	-0.0826	0.4537	0.0165	0.1003
schreiber2017/mirex2017	-0.1249	0.4554	0.0099	0.1200
gkiokas2012/default	-0.0182	0.5383	0.0189	0.1088
scheirer1998/percival2014	-0.0452	0.5400	0.0256	0.1798

Table 12: Mean OE1/OE2 for estimates compared to version 1.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 1.0

Figure 35: OE₁ for estimates compared to version 1.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 1.0

Figure 36: OE₂ for estimates compared to version 1.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean OE₁/OE₂ Results for 2.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2015/tempodetector2016_default	-0.0332	0.3376	-0.0035	0.0612
schreiber2018/ismir2018	0.0098	0.3512	-0.0018	0.0976
schreiber2018/cnn	0.0249	0.3575	-0.0003	0.1097
boeck2020/dar	-0.0128	0.3581	-0.0075	0.0680
sun2021/default	0.0024	0.3705	-0.0066	0.0711
boeck2019/multi_task_hjdb	-0.0254	0.3737	-0.0021	0.0711
oliveira2010/ibt	0.0480	0.3851	0.0029	0.1337
boeck2019/multi_task	-0.0196	0.3865	-0.0015	0.0681
klapuri2006/percival2014	0.0213	0.4203	-0.0057	0.1104
schreiber2014/default	-0.1638	0.4234	-0.0106	0.0995
schreiber2018/fcn	-0.0017	0.4235	0.0039	0.0750
davies2009/mirex_qm_tempotracker	0.1515	0.4353	0.0085	0.1013
schreiber2017/ismir2017	-0.1061	0.4372	0.0009	0.0961
echonest/version_3_2_1	-0.1981	0.4434	-0.0054	0.0916
zplane/auftakt_v3	-0.0203	0.4444	-0.0068	0.1232
percival2014/stem	-0.0979	0.4600	0.0102	0.0831
schreiber2017/mirex2017	-0.1403	0.4666	-0.0010	0.1050
scheirer1998/percival2014	-0.0610	0.5405	0.0239	0.1732
gkiokas2012/default	-0.0336	0.5414	0.0081	0.0896

Table 13: Mean OE1/OE2 for estimates compared to version 2.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 2.0

Figure 37: OE₁ for estimates compared to version 2.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 2.0

Figure 38: OE₂ for estimates compared to version 2.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean OE₁/OE₂ Results for 3.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2015/tempodetector2016_default	-0.0342	0.3365	-0.0046	0.0615
schreiber2018/ismir2018	0.0087	0.3481	-0.0029	0.0960
schreiber2018/cnn	0.0238	0.3559	0.0031	0.1099
boeck2020/dar	-0.0139	0.3568	-0.0086	0.0682
sun2021/default	0.0013	0.3691	-0.0077	0.0713
boeck2019/multi_task_hjdb	-0.0264	0.3721	-0.0031	0.0709
oliveira2010/ibt	0.0469	0.3828	0.0018	0.1350
boeck2019/multi_task	-0.0206	0.3841	-0.0026	0.0691
klapuri2006/percival2014	0.0202	0.4177	-0.0068	0.1106
schreiber2018/fcn	-0.0028	0.4193	0.0028	0.0758
schreiber2014/default	-0.1649	0.4221	-0.0117	0.0997
davies2009/mirex_qm_tempotracker	0.1505	0.4324	0.0074	0.1005
schreiber2017/ismir2017	-0.1072	0.4349	-0.0002	0.0956
zplane/auftakt_v3	-0.0214	0.4411	-0.0079	0.1262
echonest/version_3_2_1	-0.1993	0.4417	-0.0066	0.0913
percival2014/stem	-0.0989	0.4568	0.0092	0.0810
schreiber2017/mirex2017	-0.1413	0.4656	-0.0020	0.1059
gkiokas2012/default	-0.0346	0.5384	0.0070	0.0904
scheirer1998/percival2014	-0.0621	0.5388	0.0228	0.1736

Table 14: Mean OE1/OE2 for estimates compared to version 3.0 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 3.0

Figure 39: OE₁ for estimates compared to version 3.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for 3.0

Figure 40: OE₂ for estimates compared to version 3.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5938	0.7672	0.4306	0.0000	0.0000	0.9913	0.0677	0.0026	0.0341	0.4576	0.0000	0.0153	0.0018	0.0208	0.2878	0.0625	0.1422	0.6677
boeck2019/multi_task	0.5938	1.0000	0.7486	0.7762	0.0000	0.0000	0.7105	0.1644	0.0096	0.0063	0.2050	0.0000	0.0008	0.0001	0.0745	0.5399	0.1978	0.3664	0.9777
boeck2019/multi_task_hjdb	0.7672	0.7486	1.0000	0.5725	0.0000	0.0000	0.8269	0.1139	0.0047	0.0114	0.2685	0.0000	0.0026	0.0001	0.0393	0.4422	0.1235	0.2709	0.8561
boeck2020/dar	0.4306	0.7762	0.5725	1.0000	0.0000	0.0000	0.5987	0.3078	0.0474	0.0088	0.1971	0.0000	0.0012	0.0001	0.1351	0.7191	0.3435	0.5247	0.8132
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
echonest/version_3_2_1	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0007	0.0010	0.2335	0.0010	0.0488	0.0000	0.0000	0.0000	0.0000	0.0000
gkiokas2012/default	0.9913	0.7105	0.8269	0.5987	0.0000	0.0000	1.0000	0.0495	0.0072	0.0355	0.3517	0.0000	0.0466	0.0119	0.0839	0.3557	0.2170	0.3410	0.6674
klapuri2006/percival2014	0.0677	0.1644	0.1139	0.3078	0.0000	0.0000	0.0495	1.0000	0.1107	0.0000	0.0047	0.0000	0.0000	0.0000	0.8964	0.4430	0.6348	0.5606	0.0621
oliveira2010/ibt	0.0026	0.0096	0.0047	0.0474	0.0000	0.0000	0.0072	0.1107	1.0000	0.0000	0.0001	0.0000	0.0000	0.0000	0.3292	0.0826	0.0809	0.1000	0.0010
percival2014/stem	0.0341	0.0063	0.0114	0.0088	0.0000	0.0007	0.0355	0.0000	0.0000	1.0000	0.3064	0.0075	0.7640	0.2000	0.0000	0.0019	0.0000	0.0011	0.0015
scheirer1998/percival2014	0.4576	0.2050	0.2685	0.1971	0.0000	0.0010	0.3517	0.0047	0.0001	0.3064	1.0000	0.0058	0.2376	0.0591	0.0078	0.1425	0.0348	0.0697	0.1864
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.2335	0.0000	0.0000	0.0000	0.0075	0.0058	1.0000	0.0350	0.4504	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0153	0.0008	0.0026	0.0012	0.0000	0.0010	0.0466	0.0000	0.0000	0.7640	0.2376	0.0350	1.0000	0.1374	0.0000	0.0008	0.0000	0.0005	0.0035
schreiber2017/mirex2017	0.0018	0.0001	0.0001	0.0001	0.0000	0.0488	0.0119	0.0000	0.0000	0.2000	0.0591	0.4504	0.1374	1.0000	0.0000	0.0001	0.0000	0.0000	0.0004
schreiber2018/cnn	0.0208	0.0745	0.0393	0.1351	0.0000	0.0000	0.0839	0.8964	0.3292	0.0000	0.0078	0.0000	0.0000	0.0000	1.0000	0.2661	0.4333	0.3520	0.1130
schreiber2018/fcn	0.2878	0.5399	0.4422	0.7191	0.0000	0.0000	0.3557	0.4430	0.0826	0.0019	0.1425	0.0000	0.0008	0.0001	0.2661	1.0000	0.6005	0.8887	0.5287
schreiber2018/ismir2018	0.0625	0.1978	0.1235	0.3435	0.0000	0.0000	0.2170	0.6348	0.0809	0.0000	0.0348	0.0000	0.0000	0.0000	0.4333	0.6005	1.0000	0.7609	0.2161
sun2021/default	0.1422	0.3664	0.2709	0.5247	0.0000	0.0000	0.3410	0.5606	0.1000	0.0011	0.0697	0.0000	0.0005	0.0000	0.3520	0.8887	0.7609	1.0000	0.4568
zplane/auftakt_v3	0.6677	0.9777	0.8561	0.8132	0.0000	0.0000	0.6674	0.0621	0.0010	0.0015	0.1864	0.0000	0.0035	0.0004	0.1130	0.5287	0.2161	0.4568	1.0000

Table 15: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5938	0.7672	0.4306	0.0000	0.0000	0.9913	0.0677	0.0026	0.0341	0.4576	0.0000	0.0153	0.0018	0.0208	0.2878	0.0625	0.1422	0.6677
boeck2019/multi_task	0.5938	1.0000	0.7486	0.7762	0.0000	0.0000	0.7105	0.1644	0.0096	0.0063	0.2050	0.0000	0.0008	0.0001	0.0745	0.5399	0.1978	0.3664	0.9777
boeck2019/multi_task_hjdb	0.7672	0.7486	1.0000	0.5725	0.0000	0.0000	0.8269	0.1139	0.0047	0.0114	0.2685	0.0000	0.0026	0.0001	0.0393	0.4422	0.1235	0.2709	0.8561
boeck2020/dar	0.4306	0.7762	0.5725	1.0000	0.0000	0.0000	0.5987	0.3078	0.0474	0.0088	0.1971	0.0000	0.0012	0.0001	0.1351	0.7191	0.3435	0.5247	0.8132
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
echonest/version_3_2_1	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0007	0.0010	0.2335	0.0010	0.0488	0.0000	0.0000	0.0000	0.0000	0.0000
gkiokas2012/default	0.9913	0.7105	0.8269	0.5987	0.0000	0.0000	1.0000	0.0495	0.0072	0.0355	0.3517	0.0000	0.0466	0.0119	0.0839	0.3557	0.2170	0.3410	0.6674
klapuri2006/percival2014	0.0677	0.1644	0.1139	0.3078	0.0000	0.0000	0.0495	1.0000	0.1107	0.0000	0.0047	0.0000	0.0000	0.0000	0.8964	0.4430	0.6348	0.5606	0.0621
oliveira2010/ibt	0.0026	0.0096	0.0047	0.0474	0.0000	0.0000	0.0072	0.1107	1.0000	0.0000	0.0001	0.0000	0.0000	0.0000	0.3292	0.0826	0.0809	0.1000	0.0010
percival2014/stem	0.0341	0.0063	0.0114	0.0088	0.0000	0.0007	0.0355	0.0000	0.0000	1.0000	0.3064	0.0075	0.7640	0.2000	0.0000	0.0019	0.0000	0.0011	0.0015
scheirer1998/percival2014	0.4576	0.2050	0.2685	0.1971	0.0000	0.0010	0.3517	0.0047	0.0001	0.3064	1.0000	0.0058	0.2376	0.0591	0.0078	0.1425	0.0348	0.0697	0.1864
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.2335	0.0000	0.0000	0.0000	0.0075	0.0058	1.0000	0.0350	0.4504	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0153	0.0008	0.0026	0.0012	0.0000	0.0010	0.0466	0.0000	0.0000	0.7640	0.2376	0.0350	1.0000	0.1374	0.0000	0.0008	0.0000	0.0005	0.0035
schreiber2017/mirex2017	0.0018	0.0001	0.0001	0.0001	0.0000	0.0488	0.0119	0.0000	0.0000	0.2000	0.0591	0.4504	0.1374	1.0000	0.0000	0.0001	0.0000	0.0000	0.0004
schreiber2018/cnn	0.0208	0.0745	0.0393	0.1351	0.0000	0.0000	0.0839	0.8964	0.3292	0.0000	0.0078	0.0000	0.0000	0.0000	1.0000	0.2661	0.4333	0.3520	0.1130
schreiber2018/fcn	0.2878	0.5399	0.4422	0.7191	0.0000	0.0000	0.3557	0.4430	0.0826	0.0019	0.1425	0.0000	0.0008	0.0001	0.2661	1.0000	0.6005	0.8887	0.5287
schreiber2018/ismir2018	0.0625	0.1978	0.1235	0.3435	0.0000	0.0000	0.2170	0.6348	0.0809	0.0000	0.0348	0.0000	0.0000	0.0000	0.4333	0.6005	1.0000	0.7609	0.2161
sun2021/default	0.1422	0.3664	0.2709	0.5247	0.0000	0.0000	0.3410	0.5606	0.1000	0.0011	0.0697	0.0000	0.0005	0.0000	0.3520	0.8887	0.7609	1.0000	0.4568
zplane/auftakt_v3	0.6677	0.9777	0.8561	0.8132	0.0000	0.0000	0.6674	0.0621	0.0010	0.0015	0.1864	0.0000	0.0035	0.0004	0.1130	0.5287	0.2161	0.4568	1.0000

Table 16: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5938	0.7672	0.4306	0.0000	0.0000	0.9913	0.0677	0.0026	0.0341	0.4576	0.0000	0.0153	0.0018	0.0208	0.2878	0.0625	0.1422	0.6677
boeck2019/multi_task	0.5938	1.0000	0.7486	0.7762	0.0000	0.0000	0.7105	0.1644	0.0096	0.0063	0.2050	0.0000	0.0008	0.0001	0.0745	0.5399	0.1978	0.3664	0.9777
boeck2019/multi_task_hjdb	0.7672	0.7486	1.0000	0.5725	0.0000	0.0000	0.8269	0.1139	0.0047	0.0114	0.2685	0.0000	0.0026	0.0001	0.0393	0.4422	0.1235	0.2709	0.8561
boeck2020/dar	0.4306	0.7762	0.5725	1.0000	0.0000	0.0000	0.5987	0.3078	0.0474	0.0088	0.1971	0.0000	0.0012	0.0001	0.1351	0.7191	0.3435	0.5247	0.8132
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
echonest/version_3_2_1	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0007	0.0010	0.2335	0.0010	0.0488	0.0000	0.0000	0.0000	0.0000	0.0000
gkiokas2012/default	0.9913	0.7105	0.8269	0.5987	0.0000	0.0000	1.0000	0.0495	0.0072	0.0355	0.3517	0.0000	0.0466	0.0119	0.0839	0.3557	0.2170	0.3410	0.6674
klapuri2006/percival2014	0.0677	0.1644	0.1139	0.3078	0.0000	0.0000	0.0495	1.0000	0.1107	0.0000	0.0047	0.0000	0.0000	0.0000	0.8964	0.4430	0.6348	0.5606	0.0621
oliveira2010/ibt	0.0026	0.0096	0.0047	0.0474	0.0000	0.0000	0.0072	0.1107	1.0000	0.0000	0.0001	0.0000	0.0000	0.0000	0.3292	0.0826	0.0809	0.1000	0.0010
percival2014/stem	0.0341	0.0063	0.0114	0.0088	0.0000	0.0007	0.0355	0.0000	0.0000	1.0000	0.3064	0.0075	0.7640	0.2000	0.0000	0.0019	0.0000	0.0011	0.0015
scheirer1998/percival2014	0.4576	0.2050	0.2685	0.1971	0.0000	0.0010	0.3517	0.0047	0.0001	0.3064	1.0000	0.0058	0.2376	0.0591	0.0078	0.1425	0.0348	0.0697	0.1864
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.2335	0.0000	0.0000	0.0000	0.0075	0.0058	1.0000	0.0350	0.4504	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0153	0.0008	0.0026	0.0012	0.0000	0.0010	0.0466	0.0000	0.0000	0.7640	0.2376	0.0350	1.0000	0.1374	0.0000	0.0008	0.0000	0.0005	0.0035
schreiber2017/mirex2017	0.0018	0.0001	0.0001	0.0001	0.0000	0.0488	0.0119	0.0000	0.0000	0.2000	0.0591	0.4504	0.1374	1.0000	0.0000	0.0001	0.0000	0.0000	0.0004
schreiber2018/cnn	0.0208	0.0745	0.0393	0.1351	0.0000	0.0000	0.0839	0.8964	0.3292	0.0000	0.0078	0.0000	0.0000	0.0000	1.0000	0.2661	0.4333	0.3520	0.1130
schreiber2018/fcn	0.2878	0.5399	0.4422	0.7191	0.0000	0.0000	0.3557	0.4430	0.0826	0.0019	0.1425	0.0000	0.0008	0.0001	0.2661	1.0000	0.6005	0.8887	0.5287
schreiber2018/ismir2018	0.0625	0.1978	0.1235	0.3435	0.0000	0.0000	0.2170	0.6348	0.0809	0.0000	0.0348	0.0000	0.0000	0.0000	0.4333	0.6005	1.0000	0.7609	0.2161
sun2021/default	0.1422	0.3664	0.2709	0.5247	0.0000	0.0000	0.3410	0.5606	0.1000	0.0011	0.0697	0.0000	0.0005	0.0000	0.3520	0.8887	0.7609	1.0000	0.4568
zplane/auftakt_v3	0.6677	0.9777	0.8561	0.8132	0.0000	0.0000	0.6674	0.0621	0.0010	0.0015	0.1864	0.0000	0.0035	0.0004	0.1130	0.5287	0.2161	0.4568	1.0000

Table 17: Paired t-test p-values, using reference annotations 3.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.3130	0.3172	0.7394	0.9645	0.2667	0.1958	0.9588	0.3735	0.2712	0.2077	0.2162	0.7056	0.9208	0.4597	0.4862	0.9792	0.9168	0.3027
boeck2019/multi_task	0.3130	1.0000	0.8774	0.2001	0.6401	0.1176	0.4638	0.6059	0.1784	0.6925	0.4067	0.0741	0.7705	0.5394	0.8701	0.9070	0.5250	0.2943	0.1235
boeck2019/multi_task_hjdb	0.3172	0.8774	1.0000	0.1831	0.6741	0.0916	0.4501	0.6344	0.2038	0.6330	0.3691	0.0761	0.8055	0.5923	0.8267	0.8564	0.6069	0.3033	0.1459
boeck2020/dar	0.7394	0.2001	0.1831	1.0000	0.8389	0.3329	0.1223	0.8195	0.4421	0.1909	0.1897	0.2634	0.5800	0.7646	0.4021	0.3945	0.8925	0.8247	0.3822
davies2009/mirex_qm_tempotracker	0.9645	0.6401	0.6741	0.8389	1.0000	0.2750	0.2925	0.9967	0.3551	0.4907	0.2803	0.3580	0.8176	0.9732	0.6158	0.5626	0.9480	0.9290	0.3809
echonest/version_3_2_1	0.2667	0.1176	0.0916	0.3329	0.2750	1.0000	0.0265	0.3708	0.8338	0.0632	0.0648	0.9395	0.1327	0.3018	0.1642	0.1410	0.4892	0.3086	0.7348
gkiokas2012/default	0.1958	0.4638	0.4501	0.1223	0.2925	0.0265	1.0000	0.3049	0.0531	0.7723	0.6182	0.0394	0.3670	0.2816	0.6248	0.6293	0.2500	0.1340	0.0464
klapuri2006/percival2014	0.9588	0.6059	0.6344	0.8195	0.9967	0.3708	0.3049	1.0000	0.4537	0.4561	0.2315	0.3106	0.8123	0.9757	0.6024	0.6133	0.9517	0.9121	0.3130
oliveira2010/ibt	0.3735	0.1784	0.2038	0.4421	0.3551	0.8338	0.0531	0.4537	1.0000	0.1193	0.0729	0.9963	0.2679	0.3669	0.1840	0.2229	0.4451	0.4163	0.9452
percival2014/stem	0.2712	0.6925	0.6330	0.1909	0.4907	0.0632	0.7723	0.4561	0.1193	1.0000	0.5529	0.0278	0.5084	0.3971	0.8589	0.8297	0.3998	0.2366	0.0921
scheirer1998/percival2014	0.2077	0.4067	0.3691	0.1897	0.2803	0.0648	0.6182	0.2315	0.0729	0.5529	1.0000	0.0604	0.3219	0.2407	0.4966	0.5532	0.2625	0.2341	0.0342
schreiber2014/default	0.2162	0.0741	0.0761	0.2634	0.3580	0.9395	0.0394	0.3106	0.9963	0.0278	0.0604	1.0000	0.1791	0.2454	0.1183	0.0728	0.3575	0.2489	0.9420
schreiber2017/ismir2017	0.7056	0.7705	0.8055	0.5800	0.8176	0.1327	0.3670	0.8123	0.2679	0.5084	0.3219	0.1791	1.0000	0.7501	0.6886	0.7298	0.7602	0.6356	0.2305
schreiber2017/mirex2017	0.9208	0.5394	0.5923	0.7646	0.9732	0.3018	0.2816	0.9757	0.3669	0.3971	0.2407	0.2454	0.7501	1.0000	0.5770	0.5967	0.9170	0.8705	0.2765
schreiber2018/cnn	0.4597	0.8701	0.8267	0.4021	0.6158	0.1642	0.6248	0.6024	0.1840	0.8589	0.4966	0.1183	0.6886	0.5770	1.0000	0.9670	0.4497	0.4054	0.1689
schreiber2018/fcn	0.4862	0.9070	0.8564	0.3945	0.5626	0.1410	0.6293	0.6133	0.2229	0.8297	0.5532	0.0728	0.7298	0.5967	0.9670	1.0000	0.3957	0.4358	0.1688
schreiber2018/ismir2018	0.9792	0.5250	0.6069	0.8925	0.9480	0.4892	0.2500	0.9517	0.4451	0.3998	0.2625	0.3575	0.7602	0.9170	0.4497	0.3957	1.0000	0.9756	0.4232
sun2021/default	0.9168	0.2943	0.3033	0.8247	0.9290	0.3086	0.1340	0.9121	0.4163	0.2366	0.2341	0.2489	0.6356	0.8705	0.4054	0.4358	0.9756	1.0000	0.3086
zplane/auftakt_v3	0.3027	0.1235	0.1459	0.3822	0.3809	0.7348	0.0464	0.3130	0.9452	0.0921	0.0342	0.9420	0.2305	0.2765	0.1689	0.1688	0.4232	0.3086	1.0000

Table 18: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5828	0.6174	0.2239	0.0540	0.7574	0.0678	0.7561	0.4967	0.0145	0.0204	0.2464	0.4697	0.6796	0.6627	0.2139	0.7983	0.3774	0.6877
boeck2019/multi_task	0.5828	1.0000	0.8774	0.2001	0.1604	0.7759	0.1325	0.6059	0.6504	0.0752	0.0409	0.1939	0.7240	0.9262	0.8701	0.3899	0.9603	0.2943	0.5350
boeck2019/multi_task_hjdb	0.6174	0.8774	1.0000	0.1831	0.1183	0.6196	0.1413	0.6344	0.6196	0.0367	0.0314	0.2022	0.6270	0.8598	0.8267	0.3613	0.9755	0.3033	0.5851
boeck2020/dar	0.2239	0.2001	0.1831	1.0000	0.0170	0.7527	0.0184	0.8195	0.2840	0.0053	0.0101	0.6141	0.2280	0.3310	0.4021	0.0940	0.4417	0.8247	0.9391
davies2009/mirex_qm_tempotracker	0.0540	0.1604	0.1183	0.0170	1.0000	0.0480	0.9614	0.0909	0.6117	0.8248	0.2332	0.0357	0.3838	0.2754	0.3368	0.5510	0.1703	0.0403	0.1196
echonest/version_3_2_1	0.7574	0.7759	0.6196	0.7527	0.0480	1.0000	0.1280	0.9844	0.5174	0.0600	0.0198	0.5478	0.3465	0.6479	0.6167	0.2098	0.8044	0.8647	0.7065
gkiokas2012/default	0.0678	0.1325	0.1413	0.0184	0.9614	0.1280	1.0000	0.0893	0.5487	0.7994	0.1868	0.0211	0.3254	0.2450	0.2863	0.5466	0.1577	0.0179	0.0748
klapuri2006/percival2014	0.7561	0.6059	0.6344	0.8195	0.0909	0.9844	0.0893	1.0000	0.4294	0.0695	0.0195	0.5780	0.4397	0.5771	0.6024	0.2936	0.6748	0.9121	0.9085
oliveira2010/ibt	0.4967	0.6504	0.6196	0.2840	0.6117	0.5174	0.5487	0.4294	1.0000	0.5189	0.1586	0.2027	0.8259	0.7071	0.7632	0.9255	0.6612	0.3287	0.3737
percival2014/stem	0.0145	0.0752	0.0367	0.0053	0.8248	0.0600	0.7994	0.0695	0.5189	1.0000	0.3388	0.0059	0.1796	0.1400	0.2460	0.3643	0.1360	0.0044	0.0847
scheirer1998/percival2014	0.0204	0.0409	0.0314	0.0101	0.2332	0.0198	0.1868	0.0195	0.1586	0.3388	1.0000	0.0073	0.0745	0.0438	0.0670	0.1632	0.0429	0.0153	0.0110
schreiber2014/default	0.2464	0.1939	0.2022	0.6141	0.0357	0.5478	0.0211	0.5780	0.2027	0.0059	0.0073	1.0000	0.1517	0.2131	0.2511	0.0509	0.3291	0.5565	0.6997
schreiber2017/ismir2017	0.4697	0.7240	0.6270	0.2280	0.3838	0.3465	0.3254	0.4397	0.8259	0.1796	0.0745	0.1517	1.0000	0.7501	0.8829	0.6894	0.7425	0.2472	0.4464
schreiber2017/mirex2017	0.6796	0.9262	0.8598	0.3310	0.2754	0.6479	0.2450	0.5771	0.7071	0.1400	0.0438	0.2131	0.7501	1.0000	0.9432	0.5486	0.9099	0.4274	0.5359
schreiber2018/cnn	0.6627	0.8701	0.8267	0.4021	0.3368	0.6167	0.2863	0.6024	0.7632	0.2460	0.0670	0.2511	0.8829	0.9432	1.0000	0.5661	0.8198	0.4054	0.5750
schreiber2018/fcn	0.2139	0.3899	0.3613	0.0940	0.5510	0.2098	0.5466	0.2936	0.9255	0.3643	0.1632	0.0509	0.6894	0.5486	0.5661	1.0000	0.3957	0.0922	0.2714
schreiber2018/ismir2018	0.7983	0.9603	0.9755	0.4417	0.1703	0.8044	0.1577	0.6748	0.6612	0.1360	0.0429	0.3291	0.7425	0.9099	0.8198	0.3957	1.0000	0.4778	0.6070
sun2021/default	0.3774	0.2943	0.3033	0.8247	0.0403	0.8647	0.0179	0.9121	0.3287	0.0044	0.0153	0.5565	0.2472	0.4274	0.4054	0.0922	0.4778	1.0000	0.9788
zplane/auftakt_v3	0.6877	0.5350	0.5851	0.9391	0.1196	0.7065	0.0748	0.9085	0.3737	0.0847	0.0110	0.6997	0.4464	0.5359	0.5750	0.2714	0.6070	0.9788	1.0000

Table 19: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.5828	0.6174	0.2239	0.0540	0.7574	0.0678	0.7561	0.4629	0.0145	0.0204	0.2464	0.4697	0.6796	0.3168	0.2139	0.7983	0.3774	0.6877
boeck2019/multi_task	0.5828	1.0000	0.8774	0.2001	0.1604	0.7759	0.1325	0.6059	0.6258	0.0752	0.0409	0.1939	0.7240	0.9262	0.4723	0.3899	0.9603	0.2943	0.5350
boeck2019/multi_task_hjdb	0.6174	0.8774	1.0000	0.1831	0.1183	0.6196	0.1413	0.6344	0.5933	0.0367	0.0314	0.2022	0.6270	0.8598	0.4325	0.3613	0.9755	0.3033	0.5851
boeck2020/dar	0.2239	0.2001	0.1831	1.0000	0.0170	0.7527	0.0184	0.8195	0.2544	0.0053	0.0101	0.6141	0.2280	0.3310	0.1914	0.0940	0.4417	0.8247	0.9391
davies2009/mirex_qm_tempotracker	0.0540	0.1604	0.1183	0.0170	1.0000	0.0480	0.9614	0.0909	0.5583	0.8248	0.2332	0.0357	0.3838	0.2754	0.6543	0.5510	0.1703	0.0403	0.1196
echonest/version_3_2_1	0.7574	0.7759	0.6196	0.7527	0.0480	1.0000	0.1280	0.9844	0.4789	0.0600	0.0198	0.5478	0.3465	0.6479	0.3465	0.2098	0.8044	0.8647	0.7065
gkiokas2012/default	0.0678	0.1325	0.1413	0.0184	0.9614	0.1280	1.0000	0.0893	0.5614	0.7994	0.1868	0.0211	0.3254	0.2450	0.6367	0.5466	0.1577	0.0179	0.0748
klapuri2006/percival2014	0.7561	0.6059	0.6344	0.8195	0.0909	0.9844	0.0893	1.0000	0.4460	0.0695	0.0195	0.5780	0.4397	0.5771	0.3428	0.2936	0.6748	0.9121	0.9085
oliveira2010/ibt	0.4629	0.6258	0.5933	0.2544	0.5583	0.4789	0.5614	0.4460	1.0000	0.5029	0.1451	0.2069	0.8336	0.6947	0.9053	0.9257	0.6264	0.3262	0.4063
percival2014/stem	0.0145	0.0752	0.0367	0.0053	0.8248	0.0600	0.7994	0.0695	0.5029	1.0000	0.3388	0.0059	0.1796	0.1400	0.4484	0.3643	0.1360	0.0044	0.0847
scheirer1998/percival2014	0.0204	0.0409	0.0314	0.0101	0.2332	0.0198	0.1868	0.0195	0.1451	0.3388	1.0000	0.0073	0.0745	0.0438	0.1537	0.1632	0.0429	0.0153	0.0110
schreiber2014/default	0.2464	0.1939	0.2022	0.6141	0.0357	0.5478	0.0211	0.5780	0.2069	0.0059	0.0073	1.0000	0.1517	0.2131	0.1064	0.0509	0.3291	0.5565	0.6997
schreiber2017/ismir2017	0.4697	0.7240	0.6270	0.2280	0.3838	0.3465	0.3254	0.4397	0.8336	0.1796	0.0745	0.1517	1.0000	0.7501	0.7085	0.6894	0.7425	0.2472	0.4464
schreiber2017/mirex2017	0.6796	0.9262	0.8598	0.3310	0.2754	0.6479	0.2450	0.5771	0.6947	0.1400	0.0438	0.2131	0.7501	1.0000	0.5977	0.5486	0.9099	0.4274	0.5359
schreiber2018/cnn	0.3168	0.4723	0.4325	0.1914	0.6543	0.3465	0.6367	0.3428	0.9053	0.4484	0.1537	0.1064	0.7085	0.5977	1.0000	0.9629	0.4377	0.1582	0.3610
schreiber2018/fcn	0.2139	0.3899	0.3613	0.0940	0.5510	0.2098	0.5466	0.2936	0.9257	0.3643	0.1632	0.0509	0.6894	0.5486	0.9629	1.0000	0.3957	0.0922	0.2714
schreiber2018/ismir2018	0.7983	0.9603	0.9755	0.4417	0.1703	0.8044	0.1577	0.6748	0.6264	0.1360	0.0429	0.3291	0.7425	0.9099	0.4377	0.3957	1.0000	0.4778	0.6070
sun2021/default	0.3774	0.2943	0.3033	0.8247	0.0403	0.8647	0.0179	0.9121	0.3262	0.0044	0.0153	0.5565	0.2472	0.4274	0.1582	0.0922	0.4778	1.0000	0.9788
zplane/auftakt_v3	0.6877	0.5350	0.5851	0.9391	0.1196	0.7065	0.0748	0.9085	0.4063	0.0847	0.0110	0.6997	0.4464	0.5359	0.3610	0.2714	0.6070	0.9788	1.0000

Table 20: Paired t-test p-values, using reference annotations 3.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

OE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 41: Mean OE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

OE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 42: Mean OE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

OE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 43: Mean OE₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

OE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 44: Mean OE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

OE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 45: Mean OE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

OE₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 46: Mean OE₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

OE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₁ on Tempo-Subsets for 1.0

Figure 47: Mean OE₁ for estimates compared to version 1.0 for tempo intervals around T.

OE₁ on Tempo-Subsets for 2.0

Figure 48: Mean OE₁ for estimates compared to version 2.0 for tempo intervals around T.

OE₁ on Tempo-Subsets for 3.0

Figure 49: Mean OE₁ for estimates compared to version 3.0 for tempo intervals around T.

OE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₂ on Tempo-Subsets for 1.0

Figure 50: Mean OE₂ for estimates compared to version 1.0 for tempo intervals around T.

OE₂ on Tempo-Subsets for 2.0

Figure 51: Mean OE₂ for estimates compared to version 2.0 for tempo intervals around T.

OE₂ on Tempo-Subsets for 3.0

Figure 52: Mean OE₂ for estimates compared to version 3.0 for tempo intervals around T.

Estimated OE₁ for Tempo

When fitting a generalized additive model (GAM) to OE₁-values and a ground truth, what OE₁ can we expect with confidence?

Estimated OE₁ for Tempo for 1.0

Predictions of GAMs trained on OE₁ for estimates for reference 1.0.

Figure 53: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₁ for Tempo for 2.0

Predictions of GAMs trained on OE₁ for estimates for reference 2.0.

Figure 54: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₁ for Tempo for 3.0

Predictions of GAMs trained on OE₁ for estimates for reference 3.0.

Figure 55: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo

When fitting a generalized additive model (GAM) to OE₂-values and a ground truth, what OE₂ can we expect with confidence?

Estimated OE₂ for Tempo for 1.0

Predictions of GAMs trained on OE₂ for estimates for reference 1.0.

Figure 56: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo for 2.0

Predictions of GAMs trained on OE₂ for estimates for reference 2.0.

Figure 57: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo for 3.0

Predictions of GAMs trained on OE₂ for estimates for reference 3.0.

Figure 58: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

OE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₁ for ‘tag_open’ Tags for 1.0

Figure 59: OE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₁ for ‘tag_open’ Tags for 2.0

Figure 60: OE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₁ for ‘tag_open’ Tags for 3.0

Figure 61: OE₁ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₂ for ‘tag_open’ Tags for 1.0

Figure 62: OE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags for 2.0

Figure 63: OE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

OE₂ for ‘tag_open’ Tags for 3.0

Figure 64: OE₂ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.

AOE₁ and AOE₂

AOE₁ is defined as absolute octave error between an estimate and a reference value: AOE₁(E) = |log₂(E/R)|.

AOE₂ is the minimum of AOE₁ allowing the octave errors 2, 3, 1/2, and 1/3: AOE₂(E) = min(AOE₁(E), AOE₁(2E), AOE₁(3E), AOE₁(½E), AOE₁(⅓E)).

Mean AOE₁/AOE₂ Results for 1.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2015/tempodetector2016_default	0.1277	0.3071	0.0323	0.0784
boeck2020/dar	0.1325	0.3256	0.0306	0.0783
schreiber2018/ismir2018	0.1437	0.3080	0.0459	0.0996
boeck2019/multi_task_hjdb	0.1440	0.3278	0.0341	0.0893
boeck2019/multi_task	0.1508	0.3368	0.0328	0.0868
sun2021/default	0.1537	0.3254	0.0366	0.0842
schreiber2018/cnn	0.1555	0.3180	0.0519	0.1144
schreiber2018/fcn	0.1804	0.3720	0.0392	0.0878
oliveira2010/ibt	0.1940	0.3368	0.0661	0.1271
schreiber2017/ismir2017	0.2083	0.3808	0.0465	0.1057
klapuri2006/percival2014	0.2110	0.3670	0.0526	0.1158
schreiber2017/mirex2017	0.2200	0.4178	0.0451	0.1117
schreiber2014/default	0.2250	0.3807	0.0438	0.0991
zplane/auftakt_v3	0.2261	0.3750	0.0583	0.1263
percival2014/stem	0.2357	0.3964	0.0384	0.0941
davies2009/mirex_qm_tempotracker	0.2413	0.4057	0.0584	0.0934
echonest/version_3_2_1	0.2493	0.4055	0.0405	0.0954
gkiokas2012/default	0.2992	0.4479	0.0434	0.1015
scheirer1998/percival2014	0.3447	0.4182	0.1016	0.1506

Table 21: Mean AOE1/AOE2 for estimates compared to version 1.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 1.0

Figure 65: AOE₁ for estimates compared to version 1.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 1.0

Figure 66: AOE₂ for estimates compared to version 1.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean AOE₁/AOE₂ Results for 2.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2015/tempodetector2016_default	0.1246	0.3156	0.0217	0.0573
boeck2020/dar	0.1295	0.3341	0.0206	0.0652
schreiber2018/ismir2018	0.1439	0.3205	0.0376	0.0901
boeck2019/multi_task_hjdb	0.1460	0.3449	0.0221	0.0676
schreiber2018/cnn	0.1506	0.3252	0.0420	0.1013
sun2021/default	0.1529	0.3376	0.0272	0.0660
boeck2019/multi_task	0.1541	0.3550	0.0210	0.0648
schreiber2018/fcn	0.1814	0.3827	0.0290	0.0693
oliveira2010/ibt	0.1884	0.3392	0.0584	0.1203
klapuri2006/percival2014	0.2036	0.3683	0.0416	0.1024
schreiber2017/ismir2017	0.2118	0.3970	0.0339	0.0900
schreiber2017/mirex2017	0.2233	0.4331	0.0334	0.0996
zplane/auftakt_v3	0.2261	0.3831	0.0470	0.1141
davies2009/mirex_qm_tempotracker	0.2297	0.3996	0.0505	0.0882
schreiber2014/default	0.2304	0.3912	0.0361	0.0934
percival2014/stem	0.2369	0.4063	0.0266	0.0794
echonest/version_3_2_1	0.2514	0.4155	0.0335	0.0854
gkiokas2012/default	0.2964	0.4543	0.0324	0.0840
scheirer1998/percival2014	0.3435	0.4218	0.0953	0.1466

Table 22: Mean AOE1/AOE2 for estimates compared to version 2.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 2.0

Figure 67: AOE₁ for estimates compared to version 2.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 2.0

Figure 68: AOE₂ for estimates compared to version 2.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Mean AOE₁/AOE₂ Results for 3.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2015/tempodetector2016_default	0.1243	0.3146	0.0218	0.0577
boeck2020/dar	0.1290	0.3329	0.0208	0.0655
schreiber2018/ismir2018	0.1422	0.3178	0.0377	0.0884
boeck2019/multi_task_hjdb	0.1455	0.3435	0.0223	0.0673
schreiber2018/cnn	0.1504	0.3234	0.0428	0.1013
sun2021/default	0.1530	0.3359	0.0281	0.0660
boeck2019/multi_task	0.1536	0.3526	0.0218	0.0657
schreiber2018/fcn	0.1801	0.3787	0.0299	0.0698
oliveira2010/ibt	0.1875	0.3370	0.0592	0.1214
klapuri2006/percival2014	0.2026	0.3659	0.0420	0.1025
schreiber2017/ismir2017	0.2111	0.3951	0.0346	0.0891
schreiber2017/mirex2017	0.2234	0.4323	0.0344	0.1002
zplane/auftakt_v3	0.2249	0.3801	0.0491	0.1165
davies2009/mirex_qm_tempotracker	0.2283	0.3968	0.0504	0.0872
schreiber2014/default	0.2306	0.3901	0.0372	0.0932
percival2014/stem	0.2353	0.4038	0.0270	0.0769
echonest/version_3_2_1	0.2514	0.4142	0.0340	0.0850
gkiokas2012/default	0.2961	0.4510	0.0336	0.0842
scheirer1998/percival2014	0.3424	0.4207	0.0955	0.1467

Table 23: Mean AOE1/AOE2 for estimates compared to version 3.0 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 3.0

Figure 69: AOE₁ for estimates compared to version 3.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for 3.0

Figure 70: AOE₂ for estimates compared to version 3.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.3183	0.4767	0.8417	0.0001	0.0001	0.0000	0.0029	0.0102	0.0002	0.0000	0.0003	0.0022	0.0026	0.2520	0.0637	0.4657	0.2744	0.0004
boeck2019/multi_task	0.3183	1.0000	0.6885	0.4067	0.0034	0.0013	0.0000	0.0299	0.0743	0.0024	0.0000	0.0113	0.0230	0.0226	0.8367	0.2850	0.7403	0.8967	0.0046
boeck2019/multi_task_hjdb	0.4767	0.6885	1.0000	0.5659	0.0011	0.0005	0.0000	0.0155	0.0340	0.0011	0.0000	0.0058	0.0135	0.0094	0.6135	0.2170	0.9895	0.6791	0.0017
boeck2020/dar	0.8417	0.4067	0.5659	1.0000	0.0009	0.0004	0.0000	0.0112	0.0257	0.0006	0.0000	0.0022	0.0042	0.0034	0.3207	0.1000	0.6101	0.3316	0.0011
davies2009/mirex_qm_tempotracker	0.0001	0.0034	0.0011	0.0009	1.0000	0.7607	0.0539	0.1281	0.0123	0.8503	0.0010	0.6292	0.2979	0.5218	0.0027	0.0508	0.0002	0.0076	0.5185
echonest/version_3_2_1	0.0001	0.0013	0.0005	0.0004	0.7607	1.0000	0.1165	0.2270	0.0732	0.6450	0.0034	0.3711	0.1157	0.2505	0.0030	0.0383	0.0008	0.0028	0.4183
gkiokas2012/default	0.0000	0.0000	0.0000	0.0000	0.0539	0.1165	1.0000	0.0011	0.0003	0.0340	0.2384	0.0142	0.0062	0.0203	0.0000	0.0002	0.0000	0.0000	0.0107
klapuri2006/percival2014	0.0029	0.0299	0.0155	0.0112	0.1281	0.2270	0.0011	1.0000	0.2724	0.2784	0.0000	0.6140	0.9220	0.7737	0.0336	0.2745	0.0023	0.0650	0.4645
oliveira2010/ibt	0.0102	0.0743	0.0340	0.0257	0.0123	0.0732	0.0003	0.2724	1.0000	0.0730	0.0000	0.2803	0.5851	0.3812	0.0990	0.6170	0.0147	0.1344	0.0923
percival2014/stem	0.0002	0.0024	0.0011	0.0006	0.8503	0.6450	0.0340	0.2784	0.0730	1.0000	0.0002	0.6548	0.2578	0.5728	0.0042	0.0613	0.0003	0.0059	0.6778
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0010	0.0034	0.2384	0.0000	0.0000	0.0002	1.0000	0.0002	0.0000	0.0002	0.0000	0.0000	0.0000	0.0000	0.0001
schreiber2014/default	0.0003	0.0113	0.0058	0.0022	0.6292	0.3711	0.0142	0.6140	0.2803	0.6548	0.0002	1.0000	0.5022	0.8588	0.0117	0.1548	0.0033	0.0211	0.9653
schreiber2017/ismir2017	0.0022	0.0230	0.0135	0.0042	0.2979	0.1157	0.0062	0.9220	0.5851	0.2578	0.0000	0.5022	1.0000	0.5635	0.0510	0.3250	0.0076	0.0589	0.5011
schreiber2017/mirex2017	0.0026	0.0226	0.0094	0.0034	0.5218	0.2505	0.0203	0.7737	0.3812	0.5728	0.0002	0.8588	0.5635	1.0000	0.0306	0.1818	0.0048	0.0357	0.8302
schreiber2018/cnn	0.2520	0.8367	0.6135	0.3207	0.0027	0.0030	0.0000	0.0336	0.0990	0.0042	0.0000	0.0117	0.0510	0.0306	1.0000	0.2784	0.5194	0.9382	0.0090
schreiber2018/fcn	0.0637	0.2850	0.2170	0.1000	0.0508	0.0383	0.0002	0.2745	0.6170	0.0613	0.0000	0.1548	0.3250	0.1818	0.2784	1.0000	0.0857	0.3424	0.1073
schreiber2018/ismir2018	0.4657	0.7403	0.9895	0.6101	0.0002	0.0008	0.0000	0.0023	0.0147	0.0003	0.0000	0.0033	0.0076	0.0048	0.5194	0.0857	1.0000	0.6658	0.0003
sun2021/default	0.2744	0.8967	0.6791	0.3316	0.0076	0.0028	0.0000	0.0650	0.1344	0.0059	0.0000	0.0211	0.0589	0.0357	0.9382	0.3424	0.6658	1.0000	0.0117
zplane/auftakt_v3	0.0004	0.0046	0.0017	0.0011	0.5185	0.4183	0.0107	0.4645	0.0923	0.6778	0.0001	0.9653	0.5011	0.8302	0.0090	0.1073	0.0003	0.0117	1.0000

Table 24: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.2161	0.3639	0.8429	0.0005	0.0000	0.0000	0.0063	0.0142	0.0001	0.0000	0.0001	0.0013	0.0016	0.2930	0.0484	0.3820	0.2382	0.0003
boeck2019/multi_task	0.2161	1.0000	0.6403	0.2972	0.0188	0.0020	0.0001	0.0859	0.1726	0.0035	0.0000	0.0100	0.0227	0.0267	0.8841	0.3384	0.6409	0.9569	0.0082
boeck2019/multi_task_hjdb	0.3639	0.6403	1.0000	0.4446	0.0066	0.0008	0.0000	0.0440	0.0835	0.0013	0.0000	0.0046	0.0126	0.0098	0.8483	0.2397	0.9222	0.7784	0.0025
boeck2020/dar	0.8429	0.2972	0.4446	1.0000	0.0028	0.0003	0.0000	0.0206	0.0393	0.0005	0.0000	0.0013	0.0030	0.0026	0.3900	0.0804	0.5334	0.3175	0.0009
davies2009/mirex_qm_tempotracker	0.0005	0.0188	0.0066	0.0028	1.0000	0.5078	0.0306	0.1988	0.0322	0.8137	0.0003	0.9837	0.5866	0.8505	0.0065	0.1314	0.0014	0.0215	0.8816
echonest/version_3_2_1	0.0000	0.0020	0.0008	0.0003	0.5078	1.0000	0.1697	0.1404	0.0464	0.6304	0.0058	0.4628	0.1433	0.2845	0.0021	0.0399	0.0009	0.0029	0.3887
gkiokas2012/default	0.0000	0.0001	0.0000	0.0000	0.0306	0.1697	1.0000	0.0008	0.0003	0.0498	0.2243	0.0311	0.0125	0.0365	0.0000	0.0005	0.0000	0.0001	0.0166
klapuri2006/percival2014	0.0063	0.0859	0.0440	0.0206	0.1988	0.1404	0.0008	1.0000	0.3551	0.1568	0.0000	0.3418	0.7835	0.5433	0.0491	0.4473	0.0110	0.1118	0.3055
oliveira2010/ibt	0.0142	0.1726	0.0835	0.0393	0.0322	0.0464	0.0003	0.3551	1.0000	0.0461	0.0000	0.1541	0.3918	0.2564	0.1053	0.7984	0.0348	0.1904	0.0597
percival2014/stem	0.0001	0.0035	0.0013	0.0005	0.8137	0.6304	0.0498	0.1568	0.0461	1.0000	0.0004	0.7904	0.3085	0.6317	0.0028	0.0644	0.0004	0.0057	0.6471
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0003	0.0058	0.2243	0.0000	0.0000	0.0004	1.0000	0.0004	0.0000	0.0005	0.0000	0.0000	0.0000	0.0000	0.0001
schreiber2014/default	0.0001	0.0100	0.0046	0.0013	0.9837	0.4628	0.0311	0.3418	0.1541	0.7904	0.0004	1.0000	0.4617	0.8023	0.0047	0.1249	0.0022	0.0141	0.8667
schreiber2017/ismir2017	0.0013	0.0227	0.0126	0.0030	0.5866	0.1433	0.0125	0.7835	0.3918	0.3085	0.0000	0.4617	1.0000	0.5926	0.0307	0.2988	0.0065	0.0478	0.5993
schreiber2017/mirex2017	0.0016	0.0267	0.0098	0.0026	0.8505	0.2845	0.0365	0.5433	0.2564	0.6317	0.0005	0.8023	0.5926	1.0000	0.0178	0.1656	0.0044	0.0286	0.9227
schreiber2018/cnn	0.2930	0.8841	0.8483	0.3900	0.0065	0.0021	0.0000	0.0491	0.1053	0.0028	0.0000	0.0047	0.0307	0.0178	1.0000	0.1861	0.7209	0.9233	0.0063
schreiber2018/fcn	0.0484	0.3384	0.2397	0.0804	0.1314	0.0399	0.0005	0.4473	0.7984	0.0644	0.0000	0.1249	0.2988	0.1656	0.1861	1.0000	0.0821	0.3125	0.1189
schreiber2018/ismir2018	0.3820	0.6409	0.9222	0.5334	0.0014	0.0009	0.0000	0.0110	0.0348	0.0004	0.0000	0.0022	0.0065	0.0044	0.7209	0.0821	1.0000	0.6984	0.0004
sun2021/default	0.2382	0.9569	0.7784	0.3175	0.0215	0.0029	0.0001	0.1118	0.1904	0.0057	0.0000	0.0141	0.0478	0.0286	0.9233	0.3125	0.6984	1.0000	0.0111
zplane/auftakt_v3	0.0003	0.0082	0.0025	0.0009	0.8816	0.3887	0.0166	0.3055	0.0597	0.6471	0.0001	0.8667	0.5993	0.9227	0.0063	0.1189	0.0004	0.0111	1.0000

Table 25: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.2192	0.3690	0.8520	0.0006	0.0000	0.0000	0.0065	0.0145	0.0001	0.0000	0.0001	0.0013	0.0015	0.2893	0.0507	0.4143	0.2310	0.0003
boeck2019/multi_task	0.2192	1.0000	0.6426	0.2945	0.0190	0.0019	0.0001	0.0864	0.1726	0.0037	0.0000	0.0092	0.0229	0.0249	0.8935	0.3471	0.6018	0.9793	0.0084
boeck2019/multi_task_hjdb	0.3690	0.6426	1.0000	0.4415	0.0069	0.0007	0.0000	0.0447	0.0840	0.0014	0.0000	0.0043	0.0130	0.0090	0.8374	0.2476	0.8793	0.7580	0.0026
boeck2020/dar	0.8520	0.2945	0.4415	1.0000	0.0029	0.0003	0.0000	0.0206	0.0391	0.0005	0.0000	0.0011	0.0030	0.0023	0.3797	0.0832	0.5638	0.3005	0.0010
davies2009/mirex_qm_tempotracker	0.0006	0.0190	0.0069	0.0029	1.0000	0.4803	0.0279	0.2053	0.0334	0.8179	0.0003	0.9461	0.6011	0.8851	0.0073	0.1305	0.0013	0.0233	0.8878
echonest/version_3_2_1	0.0000	0.0019	0.0007	0.0003	0.4803	1.0000	0.1687	0.1300	0.0424	0.5915	0.0062	0.4674	0.1370	0.2862	0.0019	0.0352	0.0007	0.0029	0.3682
gkiokas2012/default	0.0000	0.0001	0.0000	0.0000	0.0279	0.1687	1.0000	0.0007	0.0002	0.0443	0.2315	0.0314	0.0116	0.0365	0.0000	0.0004	0.0000	0.0001	0.0145
klapuri2006/percival2014	0.0065	0.0864	0.0447	0.0206	0.2053	0.1300	0.0007	1.0000	0.3602	0.1617	0.0000	0.3170	0.7720	0.5189	0.0522	0.4414	0.0099	0.1179	0.3057
oliveira2010/ibt	0.0145	0.1726	0.0840	0.0391	0.0334	0.0424	0.0002	0.3602	1.0000	0.0482	0.0000	0.1420	0.3850	0.2426	0.1111	0.7875	0.0310	0.2004	0.0591
percival2014/stem	0.0001	0.0037	0.0014	0.0005	0.8179	0.5915	0.0443	0.1617	0.0482	1.0000	0.0004	0.8466	0.3246	0.6755	0.0031	0.0657	0.0004	0.0064	0.6593
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0003	0.0062	0.2315	0.0000	0.0000	0.0004	1.0000	0.0005	0.0000	0.0006	0.0000	0.0000	0.0000	0.0000	0.0001
schreiber2014/default	0.0001	0.0092	0.0043	0.0011	0.9461	0.4674	0.0314	0.3170	0.1420	0.8466	0.0005	1.0000	0.4413	0.7998	0.0043	0.1125	0.0017	0.0138	0.8243
schreiber2017/ismir2017	0.0013	0.0229	0.0130	0.0030	0.6011	0.1370	0.0116	0.7720	0.3850	0.3246	0.0000	0.4413	1.0000	0.5648	0.0312	0.2870	0.0057	0.0498	0.6128
schreiber2017/mirex2017	0.0015	0.0249	0.0090	0.0023	0.8851	0.2862	0.0365	0.5189	0.2426	0.6755	0.0006	0.7998	0.5648	1.0000	0.0171	0.1515	0.0037	0.0280	0.9588
schreiber2018/cnn	0.2893	0.8935	0.8374	0.3797	0.0073	0.0019	0.0000	0.0522	0.1111	0.0031	0.0000	0.0043	0.0312	0.0171	1.0000	0.1986	0.6613	0.9108	0.0066
schreiber2018/fcn	0.0507	0.3471	0.2476	0.0832	0.1305	0.0352	0.0004	0.4414	0.7875	0.0657	0.0000	0.1125	0.2870	0.1515	0.1986	1.0000	0.0781	0.3323	0.1174
schreiber2018/ismir2018	0.4143	0.6018	0.8793	0.5638	0.0013	0.0007	0.0000	0.0099	0.0310	0.0004	0.0000	0.0017	0.0057	0.0037	0.6613	0.0781	1.0000	0.6399	0.0003
sun2021/default	0.2310	0.9793	0.7580	0.3005	0.0233	0.0029	0.0001	0.1179	0.2004	0.0064	0.0000	0.0138	0.0498	0.0280	0.9108	0.3323	0.6399	1.0000	0.0122
zplane/auftakt_v3	0.0003	0.0084	0.0026	0.0010	0.8878	0.3682	0.0145	0.3057	0.0591	0.6593	0.0001	0.8243	0.6128	0.9588	0.0066	0.1174	0.0003	0.0122	1.0000

Table 26: Paired t-test p-values, using reference annotations 3.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.9060	0.5168	0.4308	0.0000	0.1275	0.0461	0.0022	0.0000	0.1672	0.0000	0.0238	0.0103	0.0330	0.0025	0.1621	0.0198	0.0797	0.0003
boeck2019/multi_task	0.9060	1.0000	0.6816	0.6050	0.0000	0.0640	0.0539	0.0068	0.0000	0.2931	0.0000	0.0750	0.0059	0.0463	0.0048	0.2258	0.0163	0.3371	0.0006
boeck2019/multi_task_hjdb	0.5168	0.6816	1.0000	0.3054	0.0000	0.2677	0.1180	0.0081	0.0000	0.3861	0.0000	0.0941	0.0216	0.0717	0.0105	0.3448	0.0562	0.4601	0.0015
boeck2020/dar	0.4308	0.6050	0.3054	1.0000	0.0000	0.0584	0.0257	0.0010	0.0000	0.0855	0.0000	0.0068	0.0049	0.0151	0.0008	0.1016	0.0062	0.0398	0.0001
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0002	0.0047	0.2812	0.1873	0.0000	0.0000	0.0183	0.0246	0.0339	0.3224	0.0007	0.0232	0.0001	0.9936
echonest/version_3_2_1	0.1275	0.0640	0.2677	0.0584	0.0002	1.0000	0.6962	0.0344	0.0002	0.7261	0.0000	0.6212	0.4650	0.5042	0.0956	0.8139	0.5058	0.5014	0.0164
gkiokas2012/default	0.0461	0.0539	0.1180	0.0257	0.0047	0.6962	1.0000	0.1517	0.0007	0.3903	0.0000	0.9594	0.6162	0.8127	0.1800	0.4263	0.6554	0.2375	0.0266
klapuri2006/percival2014	0.0022	0.0068	0.0081	0.0010	0.2812	0.0344	0.1517	1.0000	0.0328	0.0483	0.0000	0.2258	0.3421	0.2496	0.9375	0.0443	0.3750	0.0210	0.4732
oliveira2010/ibt	0.0000	0.0000	0.0000	0.0000	0.1873	0.0002	0.0007	0.0328	1.0000	0.0000	0.0030	0.0047	0.0017	0.0013	0.0401	0.0008	0.0024	0.0002	0.2674
percival2014/stem	0.1672	0.2931	0.3861	0.0855	0.0000	0.7261	0.3903	0.0483	0.0000	1.0000	0.0000	0.3870	0.1518	0.2931	0.0049	0.8921	0.1020	0.7203	0.0033
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0030	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0001
schreiber2014/default	0.0238	0.0750	0.0941	0.0068	0.0183	0.6212	0.9594	0.2258	0.0047	0.3870	0.0000	1.0000	0.6829	0.8442	0.2252	0.4292	0.7466	0.1925	0.0654
schreiber2017/ismir2017	0.0103	0.0059	0.0216	0.0049	0.0246	0.4650	0.6162	0.3421	0.0017	0.1518	0.0000	0.6829	1.0000	0.7473	0.4173	0.2605	0.9264	0.0895	0.0738
schreiber2017/mirex2017	0.0330	0.0463	0.0717	0.0151	0.0339	0.5042	0.8127	0.2496	0.0013	0.2931	0.0000	0.8442	0.7473	1.0000	0.3092	0.4097	0.8944	0.1945	0.0581
schreiber2018/cnn	0.0025	0.0048	0.0105	0.0008	0.3224	0.0956	0.1800	0.9375	0.0401	0.0049	0.0000	0.2252	0.4173	0.3092	1.0000	0.0493	0.1908	0.0246	0.3172
schreiber2018/fcn	0.1621	0.2258	0.3448	0.1016	0.0007	0.8139	0.4263	0.0443	0.0008	0.8921	0.0000	0.4292	0.2605	0.4097	0.0493	1.0000	0.2682	0.6235	0.0067
schreiber2018/ismir2018	0.0198	0.0163	0.0562	0.0062	0.0232	0.5058	0.6554	0.3750	0.0024	0.1020	0.0000	0.7466	0.9264	0.8944	0.1908	0.2682	1.0000	0.0921	0.0625
sun2021/default	0.0797	0.3371	0.4601	0.0398	0.0001	0.5014	0.2375	0.0210	0.0002	0.7203	0.0000	0.1925	0.0895	0.1945	0.0246	0.6235	0.0921	1.0000	0.0038
zplane/auftakt_v3	0.0003	0.0006	0.0015	0.0001	0.9936	0.0164	0.0266	0.4732	0.2674	0.0033	0.0001	0.0654	0.0738	0.0581	0.3172	0.0067	0.0625	0.0038	1.0000

Table 27: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.8291	0.8866	0.6496	0.0000	0.0422	0.0593	0.0023	0.0000	0.3065	0.0000	0.0057	0.0241	0.0486	0.0019	0.1644	0.0079	0.0354	0.0005
boeck2019/multi_task	0.8291	1.0000	0.7342	0.9209	0.0000	0.0123	0.0445	0.0046	0.0000	0.3249	0.0000	0.0157	0.0083	0.0446	0.0022	0.1572	0.0035	0.1375	0.0005
boeck2019/multi_task_hjdb	0.8866	0.7342	1.0000	0.6756	0.0000	0.0698	0.0920	0.0051	0.0000	0.4078	0.0000	0.0171	0.0290	0.0648	0.0048	0.2303	0.0162	0.1568	0.0011
boeck2020/dar	0.6496	0.9209	0.6756	1.0000	0.0000	0.0301	0.0437	0.0022	0.0000	0.2510	0.0000	0.0026	0.0215	0.0355	0.0012	0.1505	0.0045	0.0277	0.0002
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0009	0.0008	0.1181	0.1561	0.0000	0.0000	0.0360	0.0039	0.0111	0.1941	0.0007	0.0312	0.0002	0.6198
echonest/version_3_2_1	0.0422	0.0123	0.0698	0.0301	0.0009	1.0000	0.7802	0.1836	0.0005	0.2631	0.0000	0.7055	0.8053	0.9399	0.2138	0.4377	0.6496	0.3274	0.0840
gkiokas2012/default	0.0593	0.0445	0.0920	0.0437	0.0008	0.7802	1.0000	0.1562	0.0002	0.3259	0.0000	0.6070	0.8167	0.8904	0.1274	0.5501	0.3844	0.3660	0.0395
klapuri2006/percival2014	0.0023	0.0046	0.0051	0.0022	0.1181	0.1836	0.1562	1.0000	0.0115	0.0385	0.0000	0.4580	0.2377	0.2062	0.9621	0.0655	0.5974	0.0356	0.5048
oliveira2010/ibt	0.0000	0.0000	0.0000	0.0000	0.1561	0.0005	0.0002	0.0115	1.0000	0.0000	0.0016	0.0065	0.0002	0.0003	0.0178	0.0005	0.0027	0.0001	0.1088
percival2014/stem	0.3065	0.3249	0.4078	0.2510	0.0000	0.2631	0.3259	0.0385	0.0000	1.0000	0.0000	0.1491	0.2293	0.3096	0.0021	0.7092	0.0386	0.9185	0.0027
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0016	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0057	0.0157	0.0171	0.0026	0.0360	0.7055	0.6070	0.4580	0.0065	0.1491	0.0000	1.0000	0.7438	0.6815	0.3991	0.2539	0.8290	0.1145	0.1722
schreiber2017/ismir2017	0.0241	0.0083	0.0290	0.0215	0.0039	0.8053	0.8167	0.2377	0.0002	0.2293	0.0000	0.7438	1.0000	0.9197	0.2382	0.4713	0.5222	0.2616	0.0534
schreiber2017/mirex2017	0.0486	0.0446	0.0648	0.0355	0.0111	0.9399	0.8904	0.2062	0.0003	0.3096	0.0000	0.6815	0.9197	1.0000	0.2210	0.5493	0.5396	0.3436	0.0490
schreiber2018/cnn	0.0019	0.0022	0.0048	0.0012	0.1941	0.2138	0.1274	0.9621	0.0178	0.0021	0.0000	0.3991	0.2382	0.2210	1.0000	0.0509	0.3561	0.0349	0.4262
schreiber2018/fcn	0.1644	0.1572	0.2303	0.1505	0.0007	0.4377	0.5501	0.0655	0.0005	0.7092	0.0000	0.2539	0.4713	0.5493	0.0509	1.0000	0.1582	0.7573	0.0131
schreiber2018/ismir2018	0.0079	0.0035	0.0162	0.0045	0.0312	0.6496	0.3844	0.5974	0.0027	0.0386	0.0000	0.8290	0.5222	0.5396	0.3561	0.1582	1.0000	0.0812	0.1594
sun2021/default	0.0354	0.1375	0.1568	0.0277	0.0002	0.3274	0.3660	0.0356	0.0001	0.9185	0.0000	0.1145	0.2616	0.3436	0.0349	0.7573	0.0812	1.0000	0.0093
zplane/auftakt_v3	0.0005	0.0005	0.0011	0.0002	0.6198	0.0840	0.0395	0.5048	0.1088	0.0027	0.0000	0.1722	0.0534	0.0490	0.4262	0.0131	0.1594	0.0093	1.0000

Table 28: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	echonest/version_3_2_1	gkiokas2012/default	klapuri2006/percival2014	oliveira2010/ibt	percival2014/stem	scheirer1998/percival2014	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018	sun2021/default	zplane/auftakt_v3
boeck2015/tempodetector2016_default	1.0000	0.9840	0.8485	0.6923	0.0000	0.0333	0.0377	0.0020	0.0000	0.2639	0.0000	0.0022	0.0175	0.0340	0.0013	0.1243	0.0069	0.0136	0.0002
boeck2019/multi_task	0.9840	1.0000	0.8621	0.8305	0.0000	0.0120	0.0363	0.0055	0.0000	0.3522	0.0000	0.0124	0.0088	0.0411	0.0022	0.1500	0.0040	0.1249	0.0003
boeck2019/multi_task_hjdb	0.8485	0.8621	1.0000	0.6846	0.0000	0.0586	0.0637	0.0047	0.0000	0.3819	0.0000	0.0098	0.0220	0.0495	0.0042	0.1882	0.0156	0.1017	0.0006
boeck2020/dar	0.6923	0.8305	0.6846	1.0000	0.0000	0.0241	0.0311	0.0019	0.0000	0.2249	0.0000	0.0011	0.0173	0.0260	0.0011	0.1216	0.0041	0.0163	0.0001
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0012	0.0014	0.1337	0.1112	0.0000	0.0000	0.0459	0.0053	0.0167	0.2309	0.0011	0.0315	0.0002	0.8510
echonest/version_3_2_1	0.0333	0.0120	0.0586	0.0241	0.0012	1.0000	0.8508	0.1813	0.0004	0.2386	0.0000	0.6388	0.8436	0.9852	0.2052	0.4819	0.7166	0.3397	0.0481
gkiokas2012/default	0.0377	0.0363	0.0637	0.0311	0.0014	0.8508	1.0000	0.1959	0.0003	0.2471	0.0000	0.6076	0.8728	0.9132	0.1437	0.5160	0.4905	0.3428	0.0298
klapuri2006/percival2014	0.0020	0.0055	0.0047	0.0019	0.1337	0.1813	0.1959	1.0000	0.0099	0.0339	0.0000	0.5105	0.2604	0.2408	0.9205	0.0823	0.5724	0.0414	0.3757
oliveira2010/ibt	0.0000	0.0000	0.0000	0.0000	0.1112	0.0004	0.0003	0.0099	1.0000	0.0000	0.0020	0.0070	0.0002	0.0005	0.0179	0.0005	0.0017	0.0001	0.1536
percival2014/stem	0.2639	0.3522	0.3819	0.2249	0.0000	0.2386	0.2471	0.0339	0.0000	1.0000	0.0000	0.1045	0.1920	0.2565	0.0013	0.6254	0.0330	0.8293	0.0011
scheirer1998/percival2014	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0020	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0022	0.0124	0.0098	0.0011	0.0459	0.6388	0.6076	0.5105	0.0070	0.1045	0.0000	1.0000	0.6984	0.6623	0.4187	0.2328	0.9391	0.1037	0.1346
schreiber2017/ismir2017	0.0175	0.0088	0.0220	0.0173	0.0053	0.8436	0.8728	0.2604	0.0002	0.1920	0.0000	0.6984	1.0000	0.9653	0.2304	0.4808	0.5831	0.2681	0.0355
schreiber2017/mirex2017	0.0340	0.0411	0.0495	0.0260	0.0167	0.9852	0.9132	0.2408	0.0005	0.2565	0.0000	0.6623	0.9653	1.0000	0.2401	0.5416	0.6197	0.3376	0.0350
schreiber2018/cnn	0.0013	0.0022	0.0042	0.0011	0.2309	0.2052	0.1437	0.9205	0.0179	0.0013	0.0000	0.4187	0.2304	0.2401	1.0000	0.0486	0.2729	0.0364	0.3148
schreiber2018/fcn	0.1243	0.1500	0.1882	0.1216	0.0011	0.4819	0.5160	0.0823	0.0005	0.6254	0.0000	0.2328	0.4808	0.5416	0.0486	1.0000	0.1934	0.7559	0.0081
schreiber2018/ismir2018	0.0069	0.0040	0.0156	0.0041	0.0315	0.7166	0.4905	0.5724	0.0017	0.0330	0.0000	0.9391	0.5831	0.6197	0.2729	0.1934	1.0000	0.0981	0.0833
sun2021/default	0.0136	0.1249	0.1017	0.0163	0.0002	0.3397	0.3428	0.0414	0.0001	0.8293	0.0000	0.1037	0.2681	0.3376	0.0364	0.7559	0.0981	1.0000	0.0061
zplane/auftakt_v3	0.0002	0.0003	0.0006	0.0001	0.8510	0.0481	0.0298	0.3757	0.1536	0.0011	0.0000	0.1346	0.0355	0.0350	0.3148	0.0081	0.0833	0.0061	1.0000

Table 29: Paired t-test p-values, using reference annotations 3.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

AOE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 71: Mean AOE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

AOE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 72: Mean AOE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

AOE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 73: Mean AOE₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

AOE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 74: Mean AOE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

AOE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 75: Mean AOE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

AOE₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 76: Mean AOE₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

AOE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₁ on Tempo-Subsets for 1.0

Figure 77: Mean AOE₁ for estimates compared to version 1.0 for tempo intervals around T.

AOE₁ on Tempo-Subsets for 2.0

Figure 78: Mean AOE₁ for estimates compared to version 2.0 for tempo intervals around T.

AOE₁ on Tempo-Subsets for 3.0

Figure 79: Mean AOE₁ for estimates compared to version 3.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₂ on Tempo-Subsets for 1.0

Figure 80: Mean AOE₂ for estimates compared to version 1.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets for 2.0

Figure 81: Mean AOE₂ for estimates compared to version 2.0 for tempo intervals around T.

AOE₂ on Tempo-Subsets for 3.0

Figure 82: Mean AOE₂ for estimates compared to version 3.0 for tempo intervals around T.

Estimated AOE₁ for Tempo

When fitting a generalized additive model (GAM) to AOE₁-values and a ground truth, what AOE₁ can we expect with confidence?

Estimated AOE₁ for Tempo for 1.0

Predictions of GAMs trained on AOE₁ for estimates for reference 1.0.

Figure 83: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₁ for Tempo for 2.0

Predictions of GAMs trained on AOE₁ for estimates for reference 2.0.

Figure 84: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₁ for Tempo for 3.0

Predictions of GAMs trained on AOE₁ for estimates for reference 3.0.

Figure 85: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo

When fitting a generalized additive model (GAM) to AOE₂-values and a ground truth, what AOE₂ can we expect with confidence?

Estimated AOE₂ for Tempo for 1.0

Predictions of GAMs trained on AOE₂ for estimates for reference 1.0.

Figure 86: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo for 2.0

Predictions of GAMs trained on AOE₂ for estimates for reference 2.0.

Figure 87: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo for 3.0

Predictions of GAMs trained on AOE₂ for estimates for reference 3.0.

Figure 88: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

AOE₁ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₁ for ‘tag_open’ Tags for 1.0

Figure 89: AOE₁ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₁ for ‘tag_open’ Tags for 2.0

Figure 90: AOE₁ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₁ for ‘tag_open’ Tags for 3.0

Figure 91: AOE₁ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₂ for ‘tag_open’ Tags for 1.0

Figure 92: AOE₂ of estimates compared to version 1.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags for 2.0

Figure 93: AOE₂ of estimates compared to version 2.0 depending on tag from namespace ‘tag_open’.

AOE₂ for ‘tag_open’ Tags for 3.0

Figure 94: AOE₂ of estimates compared to version 3.0 depending on tag from namespace ‘tag_open’.