hjdb

This is the tempo_eval report for the ‘hjdb’ corpus.

Reports for other corpora may be found here.

References for ‘hjdb’
Estimates for ‘hjdb’

References for ‘hjdb’

References

1.0

Attribute	Value
Corpus	hjdb
Version	1.0
Curator	Jason Hockman
Data Source	manual annotation, GitHub repository of Sebastian Böck
Annotation Tools	derived from beat annotations
Annotation Rules	median of inter beat intervals
Annotator, bibtex	Hockman2012
Annotator, ref_url	https://github.com/superbock/ISMIR2019

2.0

Attribute	Value
Corpus	hjdb
Version	2.0
Curator	Jason Hockman
Data Source	manual annotation, GitHub repository of Sebastian Böck
Annotation Tools	derived from beat annotations
Annotation Rules	based on median of inter corresponding beat intervals
Annotator, bibtex	Hockman2012
Annotator, ref_url	https://github.com/superbock/ISMIR2019

3.0

Attribute	Value
Corpus	hjdb
Version	3.0
Curator	Sebastian Böck
Data Source	GitHub repository of Sebastian Böck
Annotation Tools	unknown
Annotation Rules	unknown
Annotator, bibtex	Boeck2019
Annotator, ref_url	https://github.com/superbock/ISMIR2019

Basic Statistics

Reference	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
1.0	235	120.00	171.43	152.15	10.24	86.00	1.00
2.0	235	120.00	172.66	152.38	10.36	87.00	1.00
3.0	235	120.00	171.43	152.14	10.24	86.00	1.00

Table 1: Basic statistics.

CSV JSON LATEX PICKLE

Smoothed Tempo Distribution

Figure 1: Percentage of values in tempo interval.

CSV JSON LATEX PICKLE SVG PDF PNG

Beat-Based Tempo Variation

Figure 2: Fraction of the dataset with beat-annotated tracks with c_var < τ.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimates for ‘hjdb’

Estimators

boeck2015/tempodetector2016_default

Attribute	Value
Corpus	hjdb
Version	0.17.dev0
Annotation Tools	TempoDetector.2016, madmom, https://github.com/CPJKU/madmom
Annotator, bibtex	Boeck2015

boeck2019/multi_task

Attribute	Value
Corpus	hjdb
Version	0.0.1
Annotation Tools	model=multi_task, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2019/multi_task_hjdb

Attribute	Value
Corpus	hjdb
Version	0.0.1
Annotation Tools	model=multi_task_hjdb, 8-fold cross validation, https://github.com/superbock/ISMIR2019
Annotator, bibtex	Boeck2019

boeck2020/dar

Attribute	Value
Corpus	hjdb
Version	0.0.1
Annotation Tools	https://github.com/superbock/ISMIR2020
Annotator, bibtex	Boeck2020

davies2009/mirex_qm_tempotracker

Attribute	Value
Corpus	hjdb
Version	1.0
Annotation Tools	QM Tempotracker, Sonic Annotator plugin. https://code.soundsoftware.ac.uk/projects/mirex2013/repository/show/audio_tempo_estimation/qm-tempotracker Note that the current macOS build of ‘qm-vamp-plugins’ was used.
Annotator, bibtex	Davies2009	Davies2007

percival2014/stem

Attribute	Value
Corpus	hjdb
Version	1.0
Annotation Tools	percival 2014, ‘tempo’ implementation from Marsyas, http://marsyas.info, git checkout tempo-stem
Annotator, bibtex	Percival2014

schreiber2014/default

Attribute	Value
Corpus	hjdb
Version	0.0.1
Annotation Tools	schreiber 2014, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2014

schreiber2017/ismir2017

Attribute	Value
Corpus	hjdb
Version	0.0.4
Annotation Tools	schreiber 2017, model=ismir2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2017/mirex2017

Attribute	Value
Corpus	hjdb
Version	0.0.4
Annotation Tools	schreiber 2017, model=mirex2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2018/cnn

Attribute	Value
Corpus	hjdb
Version	0.0.4
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=cnn), https://github.com/hendriks73/tempo-cnn

schreiber2018/fcn

Attribute	Value
Corpus	hjdb
Version	0.0.4
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=fcn), https://github.com/hendriks73/tempo-cnn

schreiber2018/ismir2018

Attribute	Value
Corpus	hjdb
Version	0.0.4
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=ismir2018), https://github.com/hendriks73/tempo-cnn

Basic Statistics

Estimator	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
boeck2015/tempodetector2016_default	236	42.55	171.43	146.62	22.57	86.00	0.93
boeck2019/multi_task	235	76.04	204.93	141.07	27.73	100.00	0.86
boeck2019/multi_task_hjdb	235	119.79	170.96	152.42	10.26	86.00	1.00
boeck2020/dar	235	120.26	172.58	151.71	10.03	87.00	1.00
davies2009/mirex_qm_tempotracker	236	71.78	166.71	125.27	30.86	79.00	0.82
percival2014/stem	236	69.84	159.63	96.34	29.52	70.00	0.83
schreiber2014/default	236	64.90	169.46	111.01	35.65	76.00	0.69
schreiber2017/ismir2017	236	69.00	167.18	120.13	34.81	75.00	0.70
schreiber2017/mirex2017	236	72.53	169.46	132.10	31.83	80.00	0.77
schreiber2018/cnn	236	77.00	170.00	148.10	19.45	85.00	0.95
schreiber2018/fcn	236	79.00	173.00	149.74	16.49	87.00	0.97
schreiber2018/ismir2018	236	77.00	176.00	145.80	22.31	85.00	0.92

Table 2: Basic statistics.

CSV JSON LATEX PICKLE

Smoothed Tempo Distribution

Figure 3: Percentage of values in tempo interval.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy

Accuracy₁ is defined as the percentage of correct estimates, allowing a 4% tolerance for individual BPM values.

Accuracy₂ additionally permits estimates to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave errors).

See [Gouyon2006].

Note: When comparing accuracy values for different algorithms, keep in mind that an algorithm may have been trained on the test set or that the test set may have even been created using one of the tested algorithms.

Accuracy Results for 1.0

Estimator	Accuracy1	Accuracy2
boeck2019/multi_task_hjdb	1.0000	1.0000
boeck2020/dar	0.9957	0.9957
schreiber2018/fcn	0.9660	1.0000
schreiber2018/cnn	0.9447	1.0000
boeck2015/tempodetector2016_default	0.9277	1.0000
schreiber2018/ismir2018	0.9064	0.9915
boeck2019/multi_task	0.8255	0.9617
schreiber2017/mirex2017	0.7404	0.9830
schreiber2017/ismir2017	0.5915	0.9830
davies2009/mirex_qm_tempotracker	0.5660	0.7872
schreiber2014/default	0.4638	0.9787
percival2014/stem	0.2851	1.0000

Table 3: Mean accuracy of estimates compared to version 1.0 with 4% tolerance ordered by Accuracy₁.

CSV JSON LATEX PICKLE

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 1.0

Figure 4: Mean Accuracy₁ for estimates compared to version 1.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ for 1.0

Figure 5: Mean Accuracy₂ for estimates compared to version 1.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy Results for 2.0

Estimator	Accuracy1	Accuracy2
boeck2019/multi_task_hjdb	1.0000	1.0000
boeck2020/dar	0.9957	0.9957
schreiber2018/fcn	0.9660	1.0000
schreiber2018/cnn	0.9447	1.0000
boeck2015/tempodetector2016_default	0.9277	1.0000
schreiber2018/ismir2018	0.9064	0.9915
boeck2019/multi_task	0.8298	0.9660
schreiber2017/mirex2017	0.7404	0.9872
schreiber2017/ismir2017	0.5915	0.9872
davies2009/mirex_qm_tempotracker	0.5660	0.7872
schreiber2014/default	0.4681	0.9830
percival2014/stem	0.2851	1.0000

Table 4: Mean accuracy of estimates compared to version 2.0 with 4% tolerance ordered by Accuracy₁.

CSV JSON LATEX PICKLE

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 2.0

Figure 6: Mean Accuracy₁ for estimates compared to version 2.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ for 2.0

Figure 7: Mean Accuracy₂ for estimates compared to version 2.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy Results for 3.0

Estimator	Accuracy1	Accuracy2
boeck2019/multi_task_hjdb	1.0000	1.0000
boeck2020/dar	0.9957	0.9957
schreiber2018/fcn	0.9660	1.0000
schreiber2018/cnn	0.9447	1.0000
boeck2015/tempodetector2016_default	0.9277	1.0000
schreiber2018/ismir2018	0.9064	0.9915
boeck2019/multi_task	0.8255	0.9617
schreiber2017/mirex2017	0.7404	0.9830
schreiber2017/ismir2017	0.5915	0.9830
davies2009/mirex_qm_tempotracker	0.5660	0.7872
schreiber2014/default	0.4638	0.9787
percival2014/stem	0.2851	1.0000

Table 5: Mean accuracy of estimates compared to version 3.0 with 4% tolerance ordered by Accuracy₁.

CSV JSON LATEX PICKLE

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for 3.0

Figure 8: Mean Accuracy₁ for estimates compared to version 3.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ for 3.0

Figure 9: Mean Accuracy₂ for estimates compared to version 3.0 depending on tolerance.

CSV JSON LATEX PICKLE SVG PDF PNG

Differing Items

For which items did a given estimator not estimate a correct value with respect to a given ground truth? Are there items which are either very difficult, not suitable for the task, or incorrectly annotated and therefore never estimated correctly, regardless which estimator is used?

Differing Items Accuracy₁

Items with different tempo annotations (Accuracy₁, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default (17 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Champion_Sound’ ‘Champion_Sound_(Doc_Scott_Remix)’ ‘Dance_Factor’ ‘Here_Comes_The_Drumz_(Drumz_VIP_Mix)’ ‘Light_Years’ ‘NHS_(Midday_Mix)’ ‘Nightmare_Walking’ ‘Renegade_Snares_(Foul_Play_Remix)’ ‘Serious_Sounds’ ‘Spiritual_Aura’ … CSV

1.0 compared with boeck2019/multi_task (41 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breaking_Free_2’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Chasin_A_Dream’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Darkman’ ‘Devotion’ … CSV

1.0 compared with boeck2019/multi_task_hjdb: No differences.

1.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

1.0 compared with davies2009/mirex_qm_tempotracker (102 differences): ‘A_Musical_Box’ ‘Aftershock’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free_2’ ‘Cant_Stop_Thinking_About’ ‘Champion_Sound’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ … CSV

1.0 compared with percival2014/stem (168 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘Another_Direction’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bish_Bosh’ ‘Bouncing’ ‘Breakage4’ … CSV

1.0 compared with schreiber2014/default (126 differences): ‘2_Bad_Mice_Take_You’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Alright_With_Me’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breakage4’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ … CSV

1.0 compared with schreiber2017/ismir2017 (96 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ … CSV

1.0 compared with schreiber2017/mirex2017 (61 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Champion_Sound’ … CSV

1.0 compared with schreiber2018/cnn (13 differences): ‘4_Am’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Devotion’ ‘Further_Intrigue’ ‘Last_Action_Hero’ ‘Open_Your_Mind’ ‘Rhythm’ ‘Rock_To_The_Groove’ ‘Sweet_Vibrations’ ‘The_Helicopter_Tune’ … CSV

1.0 compared with schreiber2018/fcn (8 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Jump_Mk_II’ ‘No_Worries’ ‘Skanka’ ‘Something_I_Feel_(2_Bad_Mice_Remix)’ ‘Stay_Calm_(Foul_Play_Remix)’ ‘Sweet_Vibrations’ ‘Touch_Somebody’ CSV

1.0 compared with schreiber2018/ismir2018 (22 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Chasin_A_Dream’ ‘Come_Back_2_Me’ ‘Fearless_Wonder_(Remix)’ ‘Fuckin_Hardcore’ ‘Hands_Of_Time’ ‘I_Need_Your_Lovin’ … CSV

2.0 compared with boeck2015/tempodetector2016_default (17 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Champion_Sound’ ‘Champion_Sound_(Doc_Scott_Remix)’ ‘Dance_Factor’ ‘Here_Comes_The_Drumz_(Drumz_VIP_Mix)’ ‘Light_Years’ ‘NHS_(Midday_Mix)’ ‘Nightmare_Walking’ ‘Renegade_Snares_(Foul_Play_Remix)’ ‘Serious_Sounds’ ‘Spiritual_Aura’ … CSV

2.0 compared with boeck2019/multi_task (40 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breaking_Free_2’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Chasin_A_Dream’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Darkman’ ‘Devotion’ … CSV

2.0 compared with boeck2019/multi_task_hjdb: No differences.

2.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

2.0 compared with davies2009/mirex_qm_tempotracker (102 differences): ‘A_Musical_Box’ ‘Aftershock’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free_2’ ‘Cant_Stop_Thinking_About’ ‘Champion_Sound’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ … CSV

2.0 compared with percival2014/stem (168 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘Another_Direction’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bish_Bosh’ ‘Bouncing’ ‘Breakage4’ … CSV

2.0 compared with schreiber2014/default (125 differences): ‘2_Bad_Mice_Take_You’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Alright_With_Me’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breakage4’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ … CSV

2.0 compared with schreiber2017/ismir2017 (96 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ … CSV

2.0 compared with schreiber2017/mirex2017 (61 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Champion_Sound’ … CSV

2.0 compared with schreiber2018/cnn (13 differences): ‘4_Am’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Devotion’ ‘Further_Intrigue’ ‘Last_Action_Hero’ ‘Open_Your_Mind’ ‘Rhythm’ ‘Rock_To_The_Groove’ ‘Sweet_Vibrations’ ‘The_Helicopter_Tune’ … CSV

2.0 compared with schreiber2018/fcn (8 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Jump_Mk_II’ ‘No_Worries’ ‘Skanka’ ‘Something_I_Feel_(2_Bad_Mice_Remix)’ ‘Stay_Calm_(Foul_Play_Remix)’ ‘Sweet_Vibrations’ ‘Touch_Somebody’ CSV

2.0 compared with schreiber2018/ismir2018 (22 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Chasin_A_Dream’ ‘Come_Back_2_Me’ ‘Fearless_Wonder_(Remix)’ ‘Fuckin_Hardcore’ ‘Hands_Of_Time’ ‘I_Need_Your_Lovin’ … CSV

3.0 compared with boeck2015/tempodetector2016_default (17 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Champion_Sound’ ‘Champion_Sound_(Doc_Scott_Remix)’ ‘Dance_Factor’ ‘Here_Comes_The_Drumz_(Drumz_VIP_Mix)’ ‘Light_Years’ ‘NHS_(Midday_Mix)’ ‘Nightmare_Walking’ ‘Renegade_Snares_(Foul_Play_Remix)’ ‘Serious_Sounds’ ‘Spiritual_Aura’ … CSV

3.0 compared with boeck2019/multi_task (41 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breaking_Free_2’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Chasin_A_Dream’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Darkman’ ‘Devotion’ … CSV

3.0 compared with boeck2019/multi_task_hjdb: No differences.

3.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

3.0 compared with davies2009/mirex_qm_tempotracker (102 differences): ‘A_Musical_Box’ ‘Aftershock’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free_2’ ‘Cant_Stop_Thinking_About’ ‘Champion_Sound’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ … CSV

3.0 compared with percival2014/stem (168 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘Another_Direction’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bish_Bosh’ ‘Bouncing’ ‘Breakage4’ … CSV

3.0 compared with schreiber2014/default (126 differences): ‘2_Bad_Mice_Take_You’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Alright_With_Me’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Breakage4’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ … CSV

3.0 compared with schreiber2017/ismir2017 (96 differences): ‘4_Am’ ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘A21’ ‘A_Musical_Box’ ‘A_New_Dawn’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Being_With_You_1’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ … CSV

3.0 compared with schreiber2017/mirex2017 (61 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_(Foul_Play_Remix)’ ‘Beyond_Bass’ ‘Bouncing’ ‘Breakage4’ ‘Breaking_Free’ ‘Cant_Stop_The_Rush_(93_Remix)’ ‘Cant_Stop_Thinking_About’ ‘Casanova_(Down_To_Earth_Remix)’ ‘Champion_Sound’ … CSV

3.0 compared with schreiber2018/cnn (13 differences): ‘4_Am’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Devotion’ ‘Further_Intrigue’ ‘Last_Action_Hero’ ‘Open_Your_Mind’ ‘Rhythm’ ‘Rock_To_The_Groove’ ‘Sweet_Vibrations’ ‘The_Helicopter_Tune’ … CSV

3.0 compared with schreiber2018/fcn (8 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Jump_Mk_II’ ‘No_Worries’ ‘Skanka’ ‘Something_I_Feel_(2_Bad_Mice_Remix)’ ‘Stay_Calm_(Foul_Play_Remix)’ ‘Sweet_Vibrations’ ‘Touch_Somebody’ CSV

3.0 compared with schreiber2018/ismir2018 (22 differences): ‘A_Musical_Box’ ‘Being_With_You’ ‘Being_With_You_1’ ‘Breakage4’ ‘Cant_Stop_Thinking_About’ ‘Chasin_A_Dream’ ‘Come_Back_2_Me’ ‘Fearless_Wonder_(Remix)’ ‘Fuckin_Hardcore’ ‘Hands_Of_Time’ ‘I_Need_Your_Lovin’ … CSV

All tracks were estimated ‘correctly’ by at least one system.

Differing Items Accuracy₂

Items with different tempo annotations (Accuracy₂, 4% tolerance) in different versions:

1.0 compared with boeck2015/tempodetector2016_default: No differences.

1.0 compared with boeck2019/multi_task (9 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Devotion’ ‘Finest_Illusion_(Legal_Mix)’ ‘Promised_Land’ ‘T-N-T’ ‘The_Element_(Highnoon)’ ‘The_R’ ‘We_Are_Hardcore’ CSV

1.0 compared with boeck2019/multi_task_hjdb: No differences.

1.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

1.0 compared with davies2009/mirex_qm_tempotracker (50 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Breaking_Free_2’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ ‘Crystalize’ ‘Dream_Sequence’ ‘Enticer’ ‘Fearless_Wonder_(Remix)’ ‘Feel_(Feel_Good)’ ‘Finest_Illusion_(Legal_Mix)’ ‘Get_High_(New_Jack_London)’ … CSV

1.0 compared with percival2014/stem: No differences.

1.0 compared with schreiber2014/default (5 differences): ‘Cant_Stop_Thinking_About’ ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

1.0 compared with schreiber2017/ismir2017 (4 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

1.0 compared with schreiber2017/mirex2017 (4 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

1.0 compared with schreiber2018/cnn: No differences.

1.0 compared with schreiber2018/fcn: No differences.

1.0 compared with schreiber2018/ismir2018 (2 differences): ‘Fearless_Wonder_(Remix)’ ‘Inna_Year_4000’ CSV

2.0 compared with boeck2015/tempodetector2016_default: No differences.

2.0 compared with boeck2019/multi_task (8 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Devotion’ ‘Promised_Land’ ‘T-N-T’ ‘The_Element_(Highnoon)’ ‘The_R’ ‘We_Are_Hardcore’ CSV

2.0 compared with boeck2019/multi_task_hjdb: No differences.

2.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

2.0 compared with davies2009/mirex_qm_tempotracker (50 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Breaking_Free_2’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ ‘Crystalize’ ‘Dream_Sequence’ ‘Enticer’ ‘Fearless_Wonder_(Remix)’ ‘Feel_(Feel_Good)’ ‘Finest_Illusion_(Legal_Mix)’ ‘Get_High_(New_Jack_London)’ … CSV

2.0 compared with percival2014/stem: No differences.

2.0 compared with schreiber2014/default (4 differences): ‘Cant_Stop_Thinking_About’ ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ CSV

2.0 compared with schreiber2017/ismir2017 (3 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ CSV

2.0 compared with schreiber2017/mirex2017 (3 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ CSV

2.0 compared with schreiber2018/cnn: No differences.

2.0 compared with schreiber2018/fcn: No differences.

2.0 compared with schreiber2018/ismir2018 (2 differences): ‘Fearless_Wonder_(Remix)’ ‘Inna_Year_4000’ CSV

3.0 compared with boeck2015/tempodetector2016_default: No differences.

3.0 compared with boeck2019/multi_task (9 differences): ‘6_Million_Ways_To_Die_(DJ_Hype_Remix)’ ‘Dark_Stranger_(Origin_Unknown_Remix)’ ‘Devotion’ ‘Finest_Illusion_(Legal_Mix)’ ‘Promised_Land’ ‘T-N-T’ ‘The_Element_(Highnoon)’ ‘The_R’ ‘We_Are_Hardcore’ CSV

3.0 compared with boeck2019/multi_task_hjdb: No differences.

3.0 compared with boeck2020/dar (1 differences): ‘Ruff!’ CSV

3.0 compared with davies2009/mirex_qm_tempotracker (50 differences): ‘Being_With_You_(Foul_Play_Remix)’ ‘Breaking_Free_2’ ‘Cold_Fresh_Air_(Remix)’ ‘Come_Back_2_Me’ ‘Crystalize’ ‘Dream_Sequence’ ‘Enticer’ ‘Fearless_Wonder_(Remix)’ ‘Feel_(Feel_Good)’ ‘Finest_Illusion_(Legal_Mix)’ ‘Get_High_(New_Jack_London)’ … CSV

3.0 compared with percival2014/stem: No differences.

3.0 compared with schreiber2014/default (5 differences): ‘Cant_Stop_Thinking_About’ ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

3.0 compared with schreiber2017/ismir2017 (4 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

3.0 compared with schreiber2017/mirex2017 (4 differences): ‘Crystalize’ ‘Deep_Space’ ‘Horizons’ ‘My_Own_(Hixxy_and_UFO_Remix)’ CSV

3.0 compared with schreiber2018/cnn: No differences.

3.0 compared with schreiber2018/fcn: No differences.

3.0 compared with schreiber2018/ismir2018 (2 differences): ‘Fearless_Wonder_(Remix)’ ‘Inna_Year_4000’ CSV

All tracks were estimated ‘correctly’ by at least one system.

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0018	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.5716	0.0931	0.5114
boeck2019/multi_task	0.0018	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0205	0.0001	0.0000	0.0110
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0078	0.0000
boeck2020/dar	0.0001	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0018	0.0391	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0153	0.5713	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0153	0.0000	1.0000	0.0020	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.5713	0.0000	0.0020	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0205	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.5716	0.0001	0.0002	0.0018	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.3593	0.1078
schreiber2018/fcn	0.0931	0.0000	0.0078	0.0391	0.0000	0.0000	0.0000	0.0000	0.0000	0.3593	1.0000	0.0043
schreiber2018/ismir2018	0.5114	0.0110	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1078	0.0043	1.0000

Table 6: McNemar p-values, using reference annotations 3.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0027	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.5716	0.0931	0.5114
boeck2019/multi_task	0.0027	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0154	0.0001	0.0000	0.0153
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0078	0.0000
boeck2020/dar	0.0001	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0018	0.0391	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0211	0.5713	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0211	0.0000	1.0000	0.0030	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.5713	0.0000	0.0030	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0154	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.5716	0.0001	0.0002	0.0018	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.3593	0.1078
schreiber2018/fcn	0.0931	0.0000	0.0078	0.0391	0.0000	0.0000	0.0000	0.0000	0.0000	0.3593	1.0000	0.0043
schreiber2018/ismir2018	0.5114	0.0153	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1078	0.0043	1.0000

Table 7: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0018	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.5716	0.0931	0.5114
boeck2019/multi_task	0.0018	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0205	0.0001	0.0000	0.0110
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0078	0.0000
boeck2020/dar	0.0001	0.0000	1.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0018	0.0391	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0153	0.5713	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0153	0.0000	1.0000	0.0020	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.5713	0.0000	0.0020	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0205	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.5716	0.0001	0.0002	0.0018	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.3593	0.1078
schreiber2018/fcn	0.0931	0.0000	0.0078	0.0391	0.0000	0.0000	0.0000	0.0000	0.0000	0.3593	1.0000	0.0043
schreiber2018/ismir2018	0.5114	0.0110	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1078	0.0043	1.0000

Table 8: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
boeck2019/multi_task	0.0039	1.0000	0.0039	0.0215	0.0000	0.0039	0.4240	0.2668	0.2668	0.0039	0.0039	0.0654
boeck2019/multi_task_hjdb	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
boeck2020/dar	1.0000	0.0215	1.0000	1.0000	0.0000	1.0000	0.2188	0.3750	0.3750	1.0000	1.0000	1.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2014/default	0.0625	0.4240	0.0625	0.2188	0.0000	0.0625	1.0000	1.0000	1.0000	0.0625	0.0625	0.4531
schreiber2017/ismir2017	0.1250	0.2668	0.1250	0.3750	0.0000	0.1250	1.0000	1.0000	1.0000	0.1250	0.1250	0.6875
schreiber2017/mirex2017	0.1250	0.2668	0.1250	0.3750	0.0000	0.1250	1.0000	1.0000	1.0000	0.1250	0.1250	0.6875
schreiber2018/cnn	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2018/fcn	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2018/ismir2018	0.5000	0.0654	0.5000	1.0000	0.0000	0.5000	0.4531	0.6875	0.6875	0.5000	0.5000	1.0000

Table 9: McNemar p-values, using reference annotations 3.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0078	1.0000	1.0000	0.0000	1.0000	0.1250	0.2500	0.2500	1.0000	1.0000	0.5000
boeck2019/multi_task	0.0078	1.0000	0.0078	0.0391	0.0000	0.0078	0.3877	0.2266	0.2266	0.0078	0.0078	0.1094
boeck2019/multi_task_hjdb	1.0000	0.0078	1.0000	1.0000	0.0000	1.0000	0.1250	0.2500	0.2500	1.0000	1.0000	0.5000
boeck2020/dar	1.0000	0.0391	1.0000	1.0000	0.0000	1.0000	0.3750	0.6250	0.6250	1.0000	1.0000	1.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	1.0000	0.0078	1.0000	1.0000	0.0000	1.0000	0.1250	0.2500	0.2500	1.0000	1.0000	0.5000
schreiber2014/default	0.1250	0.3877	0.1250	0.3750	0.0000	0.1250	1.0000	1.0000	1.0000	0.1250	0.1250	0.6875
schreiber2017/ismir2017	0.2500	0.2266	0.2500	0.6250	0.0000	0.2500	1.0000	1.0000	1.0000	0.2500	0.2500	1.0000
schreiber2017/mirex2017	0.2500	0.2266	0.2500	0.6250	0.0000	0.2500	1.0000	1.0000	1.0000	0.2500	0.2500	1.0000
schreiber2018/cnn	1.0000	0.0078	1.0000	1.0000	0.0000	1.0000	0.1250	0.2500	0.2500	1.0000	1.0000	0.5000
schreiber2018/fcn	1.0000	0.0078	1.0000	1.0000	0.0000	1.0000	0.1250	0.2500	0.2500	1.0000	1.0000	0.5000
schreiber2018/ismir2018	0.5000	0.1094	0.5000	1.0000	0.0000	0.5000	0.6875	1.0000	1.0000	0.5000	0.5000	1.0000

Table 10: McNemar p-values, using reference annotations 2.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
boeck2019/multi_task	0.0039	1.0000	0.0039	0.0215	0.0000	0.0039	0.4240	0.2668	0.2668	0.0039	0.0039	0.0654
boeck2019/multi_task_hjdb	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
boeck2020/dar	1.0000	0.0215	1.0000	1.0000	0.0000	1.0000	0.2188	0.3750	0.3750	1.0000	1.0000	1.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2014/default	0.0625	0.4240	0.0625	0.2188	0.0000	0.0625	1.0000	1.0000	1.0000	0.0625	0.0625	0.4531
schreiber2017/ismir2017	0.1250	0.2668	0.1250	0.3750	0.0000	0.1250	1.0000	1.0000	1.0000	0.1250	0.1250	0.6875
schreiber2017/mirex2017	0.1250	0.2668	0.1250	0.3750	0.0000	0.1250	1.0000	1.0000	1.0000	0.1250	0.1250	0.6875
schreiber2018/cnn	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2018/fcn	1.0000	0.0039	1.0000	1.0000	0.0000	1.0000	0.0625	0.1250	0.1250	1.0000	1.0000	0.5000
schreiber2018/ismir2018	0.5000	0.0654	0.5000	1.0000	0.0000	0.5000	0.4531	0.6875	0.6875	0.5000	0.5000	1.0000

Table 11: McNemar p-values, using reference annotations 1.0 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Accuracy₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 10: Mean Accuracy₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 11: Mean Accuracy₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 12: Mean Accuracy₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

Accuracy₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 13: Mean Accuracy₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 14: Mean Accuracy₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 15: Mean Accuracy₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₁ on Tempo-Subsets for 1.0

Figure 16: Mean Accuracy₁ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₁ on Tempo-Subsets for 2.0

Figure 17: Mean Accuracy₁ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₁ on Tempo-Subsets for 3.0

Figure 18: Mean Accuracy₁ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₂ on Tempo-Subsets for 1.0

Figure 19: Mean Accuracy₂ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on Tempo-Subsets for 2.0

Figure 20: Mean Accuracy₂ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Accuracy₂ on Tempo-Subsets for 3.0

Figure 21: Mean Accuracy₂ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₁ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₁-values and a ground truth, what Accuracy₁ can we expect with confidence?

Estimated Accuracy₁ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 1.0.

Figure 22: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₁ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 2.0.

Figure 23: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₁ for Tempo for 3.0

Predictions of GAMs trained on Accuracy₁ for estimates for reference 3.0.

Figure 24: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₂ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₂-values and a ground truth, what Accuracy₂ can we expect with confidence?

Estimated Accuracy₂ for Tempo for 1.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 1.0.

Figure 25: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₂ for Tempo for 2.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 2.0.

Figure 26: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated Accuracy₂ for Tempo for 3.0

Predictions of GAMs trained on Accuracy₂ for estimates for reference 3.0.

Figure 27: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ and OE₂

OE₁ is defined as octave error between an estimate E and a reference value R.This means that the most common errors—by a factor of 2 or ½—have the same magnitude, namely 1: OE₂(E) = log₂(E/R).

OE₂ is the signed OE₁ corresponding to the minimum absolute OE₁ allowing the octaveerrors 2, 3, 1/2, and 1/3: OE₂(E) = arg min_x(|x|) with x ∈ {OE₁(E), OE₁(2E), OE₁(3E), OE₁(½E), OE₁(⅓E)}

Mean OE₁/OE₂ Results for 1.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2019/multi_task_hjdb	0.0025	0.0106	0.0025	0.0106
boeck2020/dar	-0.0041	0.0107	-0.0041	0.0107
schreiber2018/fcn	-0.0312	0.1793	0.0029	0.0113
schreiber2018/cnn	-0.0525	0.2286	0.0029	0.0111
boeck2015/tempodetector2016_default	-0.0748	0.2705	0.0000	0.0077
schreiber2018/ismir2018	-0.0809	0.2805	0.0042	0.0205
boeck2019/multi_task	-0.1413	0.3502	0.0034	0.0740
davies2009/mirex_qm_tempotracker	-0.3228	0.4197	0.1027	0.1694
schreiber2017/mirex2017	-0.2529	0.4327	0.0024	0.0414
percival2014/stem	-0.7131	0.4507	0.0018	0.0105
schreiber2017/ismir2017	-0.4019	0.4877	0.0024	0.0414
schreiber2014/default	-0.5290	0.4968	0.0029	0.0332

Table 12: Mean OE1/OE2 for estimates compared to version 1.0 ordered by standard deviation.

CSV JSON LATEX PICKLE

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 1.0

Figure 28: OE₁ for estimates compared to version 1.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ distribution for 1.0

Figure 29: OE₂ for estimates compared to version 1.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Mean OE₁/OE₂ Results for 2.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2019/multi_task_hjdb	0.0005	0.0036	0.0005	0.0036
boeck2020/dar	-0.0061	0.0053	-0.0061	0.0053
schreiber2018/fcn	-0.0332	0.1809	0.0008	0.0044
schreiber2018/cnn	-0.0545	0.2290	0.0008	0.0039
boeck2015/tempodetector2016_default	-0.0768	0.2701	-0.0020	0.0105
schreiber2018/ismir2018	-0.0829	0.2800	0.0022	0.0174
boeck2019/multi_task	-0.1433	0.3518	0.0013	0.0729
davies2009/mirex_qm_tempotracker	-0.3248	0.4213	0.1007	0.1696
schreiber2017/mirex2017	-0.2550	0.4325	0.0004	0.0404
percival2014/stem	-0.7152	0.4519	-0.0003	0.0029
schreiber2017/ismir2017	-0.4039	0.4880	0.0004	0.0404
schreiber2014/default	-0.5310	0.4986	0.0009	0.0319

Table 13: Mean OE1/OE2 for estimates compared to version 2.0 ordered by standard deviation.

CSV JSON LATEX PICKLE

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 2.0

Figure 30: OE₁ for estimates compared to version 2.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ distribution for 2.0

Figure 31: OE₂ for estimates compared to version 2.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Mean OE₁/OE₂ Results for 3.0

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
boeck2020/dar	-0.0039	0.0107	-0.0039	0.0107
boeck2019/multi_task_hjdb	0.0027	0.0107	0.0027	0.0107
schreiber2018/fcn	-0.0310	0.1793	0.0030	0.0113
schreiber2018/cnn	-0.0523	0.2287	0.0030	0.0112
boeck2015/tempodetector2016_default	-0.0746	0.2706	0.0002	0.0075
schreiber2018/ismir2018	-0.0807	0.2805	0.0044	0.0205
boeck2019/multi_task	-0.1412	0.3503	0.0035	0.0740
davies2009/mirex_qm_tempotracker	-0.3226	0.4197	0.1029	0.1695
schreiber2017/mirex2017	-0.2528	0.4330	0.0026	0.0413
percival2014/stem	-0.7130	0.4505	0.0019	0.0106
schreiber2017/ismir2017	-0.4017	0.4878	0.0026	0.0413
schreiber2014/default	-0.5288	0.4968	0.0031	0.0332

Table 14: Mean OE1/OE2 for estimates compared to version 3.0 ordered by standard deviation.

CSV JSON LATEX PICKLE

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for 3.0

Figure 32: OE₁ for estimates compared to version 3.0. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ distribution for 3.0

Figure 33: OE₂ for estimates compared to version 3.0. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0269	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.3345	0.0375	0.8146
boeck2019/multi_task	0.0269	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0010	0.0012	0.0000	0.0279
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0003	0.0048	0.0000
boeck2020/dar	0.0001	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0014	0.0222	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0263	0.0481	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.0263	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0010	0.0000	0.0000	0.0481	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.3345	0.0012	0.0003	0.0014	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2518	0.1648
schreiber2018/fcn	0.0375	0.0000	0.0048	0.0222	0.0000	0.0000	0.0000	0.0000	0.0000	0.2518	1.0000	0.0088
schreiber2018/ismir2018	0.8146	0.0279	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1648	0.0088	1.0000

Table 15: Paired t-test p-values, using reference annotations 3.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0269	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.3345	0.0375	0.8146
boeck2019/multi_task	0.0269	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0010	0.0012	0.0000	0.0279
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0003	0.0048	0.0000
boeck2020/dar	0.0001	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0014	0.0222	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0263	0.0481	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.0263	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0010	0.0000	0.0000	0.0481	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.3345	0.0012	0.0003	0.0014	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2518	0.1648
schreiber2018/fcn	0.0375	0.0000	0.0048	0.0222	0.0000	0.0000	0.0000	0.0000	0.0000	0.2518	1.0000	0.0088
schreiber2018/ismir2018	0.8146	0.0279	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1648	0.0088	1.0000

Table 16: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0269	0.0000	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000	0.3345	0.0375	0.8146
boeck2019/multi_task	0.0269	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0010	0.0012	0.0000	0.0279
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0003	0.0048	0.0000
boeck2020/dar	0.0001	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0014	0.0222	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0263	0.0481	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.0263	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0010	0.0000	0.0000	0.0481	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.3345	0.0012	0.0003	0.0014	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2518	0.1648
schreiber2018/fcn	0.0375	0.0000	0.0048	0.0222	0.0000	0.0000	0.0000	0.0000	0.0000	0.2518	1.0000	0.0088
schreiber2018/ismir2018	0.8146	0.0279	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1648	0.0088	1.0000

Table 17: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.4893	0.0002	0.0000	0.0000	0.0099	0.1902	0.3876	0.3876	0.0001	0.0001	0.0018
boeck2019/multi_task	0.4893	1.0000	0.8610	0.1158	0.0000	0.7339	0.9291	0.8581	0.8581	0.9144	0.9180	0.8576
boeck2019/multi_task_hjdb	0.0002	0.8610	1.0000	0.0000	0.0000	0.0002	0.8617	0.9598	0.9598	0.2095	0.2480	0.1359
boeck2020/dar	0.0000	0.1158	0.0000	1.0000	0.0000	0.0000	0.0014	0.0165	0.0165	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0099	0.7339	0.0002	0.0000	0.0000	1.0000	0.5856	0.8084	0.8084	0.0000	0.0000	0.0279
schreiber2014/default	0.1902	0.9291	0.8617	0.0014	0.0000	0.5856	1.0000	0.7574	0.7574	0.9824	0.9903	0.5730
schreiber2017/ismir2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	1.0000	0.8152	0.8629	0.8581	0.5226
schreiber2017/mirex2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	0.8152	1.0000	0.8629	0.8581	0.5226
schreiber2018/cnn	0.0001	0.9144	0.2095	0.0000	0.0000	0.0000	0.9824	0.8629	0.8629	1.0000	0.9452	0.2291
schreiber2018/fcn	0.0001	0.9180	0.2480	0.0000	0.0000	0.0000	0.9903	0.8581	0.8581	0.9452	1.0000	0.2213
schreiber2018/ismir2018	0.0018	0.8576	0.1359	0.0000	0.0000	0.0279	0.5730	0.5226	0.5226	0.2291	0.2213	1.0000

Table 18: Paired t-test p-values, using reference annotations 3.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.4893	0.0002	0.0000	0.0000	0.0099	0.1902	0.3876	0.3876	0.0001	0.0001	0.0018
boeck2019/multi_task	0.4893	1.0000	0.8610	0.1158	0.0000	0.7339	0.9291	0.8581	0.8581	0.9144	0.9180	0.8576
boeck2019/multi_task_hjdb	0.0002	0.8610	1.0000	0.0000	0.0000	0.0002	0.8617	0.9598	0.9598	0.2095	0.2480	0.1359
boeck2020/dar	0.0000	0.1158	0.0000	1.0000	0.0000	0.0000	0.0014	0.0165	0.0165	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0099	0.7339	0.0002	0.0000	0.0000	1.0000	0.5856	0.8084	0.8084	0.0000	0.0000	0.0279
schreiber2014/default	0.1902	0.9291	0.8617	0.0014	0.0000	0.5856	1.0000	0.7574	0.7574	0.9824	0.9903	0.5730
schreiber2017/ismir2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	1.0000	0.8152	0.8629	0.8581	0.5226
schreiber2017/mirex2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	0.8152	1.0000	0.8629	0.8581	0.5226
schreiber2018/cnn	0.0001	0.9144	0.2095	0.0000	0.0000	0.0000	0.9824	0.8629	0.8629	1.0000	0.9452	0.2291
schreiber2018/fcn	0.0001	0.9180	0.2480	0.0000	0.0000	0.0000	0.9903	0.8581	0.8581	0.9452	1.0000	0.2213
schreiber2018/ismir2018	0.0018	0.8576	0.1359	0.0000	0.0000	0.0279	0.5730	0.5226	0.5226	0.2291	0.2213	1.0000

Table 19: Paired t-test p-values, using reference annotations 2.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.4893	0.0002	0.0000	0.0000	0.0099	0.1902	0.3876	0.3876	0.0001	0.0001	0.0018
boeck2019/multi_task	0.4893	1.0000	0.8610	0.1158	0.0000	0.7339	0.9291	0.8581	0.8581	0.9144	0.9180	0.8576
boeck2019/multi_task_hjdb	0.0002	0.8610	1.0000	0.0000	0.0000	0.0002	0.8617	0.9598	0.9598	0.2095	0.2480	0.1359
boeck2020/dar	0.0000	0.1158	0.0000	1.0000	0.0000	0.0000	0.0014	0.0165	0.0165	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0099	0.7339	0.0002	0.0000	0.0000	1.0000	0.5856	0.8084	0.8084	0.0000	0.0000	0.0279
schreiber2014/default	0.1902	0.9291	0.8617	0.0014	0.0000	0.5856	1.0000	0.7574	0.7574	0.9824	0.9903	0.5730
schreiber2017/ismir2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	1.0000	0.8152	0.8629	0.8581	0.5226
schreiber2017/mirex2017	0.3876	0.8581	0.9598	0.0165	0.0000	0.8084	0.7574	0.8152	1.0000	0.8629	0.8581	0.5226
schreiber2018/cnn	0.0001	0.9144	0.2095	0.0000	0.0000	0.0000	0.9824	0.8629	0.8629	1.0000	0.9452	0.2291
schreiber2018/fcn	0.0001	0.9180	0.2480	0.0000	0.0000	0.0000	0.9903	0.8581	0.8581	0.9452	1.0000	0.2213
schreiber2018/ismir2018	0.0018	0.8576	0.1359	0.0000	0.0000	0.0279	0.5730	0.5226	0.5226	0.2291	0.2213	1.0000

Table 20: Paired t-test p-values, using reference annotations 1.0 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

OE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 34: Mean OE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 35: Mean OE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 36: Mean OE₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

OE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 37: Mean OE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 38: Mean OE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 39: Mean OE₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₁ on Tempo-Subsets for 1.0

Figure 40: Mean OE₁ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ on Tempo-Subsets for 2.0

Figure 41: Mean OE₁ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₁ on Tempo-Subsets for 3.0

Figure 42: Mean OE₁ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₂ on Tempo-Subsets for 1.0

Figure 43: Mean OE₂ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on Tempo-Subsets for 2.0

Figure 44: Mean OE₂ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

OE₂ on Tempo-Subsets for 3.0

Figure 45: Mean OE₂ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₁ for Tempo

When fitting a generalized additive model (GAM) to OE₁-values and a ground truth, what OE₁ can we expect with confidence?

Estimated OE₁ for Tempo for 1.0

Predictions of GAMs trained on OE₁ for estimates for reference 1.0.

Figure 46: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₁ for Tempo for 2.0

Predictions of GAMs trained on OE₁ for estimates for reference 2.0.

Figure 47: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₁ for Tempo for 3.0

Predictions of GAMs trained on OE₁ for estimates for reference 3.0.

Figure 48: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₂ for Tempo

When fitting a generalized additive model (GAM) to OE₂-values and a ground truth, what OE₂ can we expect with confidence?

Estimated OE₂ for Tempo for 1.0

Predictions of GAMs trained on OE₂ for estimates for reference 1.0.

Figure 49: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₂ for Tempo for 2.0

Predictions of GAMs trained on OE₂ for estimates for reference 2.0.

Figure 50: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated OE₂ for Tempo for 3.0

Predictions of GAMs trained on OE₂ for estimates for reference 3.0.

Figure 51: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ and AOE₂

AOE₁ is defined as absolute octave error between an estimate and a reference value: AOE₁(E) = |log₂(E/R)|.

AOE₂ is the minimum of AOE₁ allowing the octave errors 2, 3, 1/2, and 1/3: AOE₂(E) = min(AOE₁(E), AOE₁(2E), AOE₁(3E), AOE₁(½E), AOE₁(⅓E)).

Mean AOE₁/AOE₂ Results for 1.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2020/dar	0.0087	0.0074	0.0087	0.0074
boeck2019/multi_task_hjdb	0.0093	0.0057	0.0093	0.0057
schreiber2018/fcn	0.0425	0.1769	0.0096	0.0066
schreiber2018/cnn	0.0643	0.2256	0.0096	0.0063
boeck2015/tempodetector2016_default	0.0762	0.2702	0.0025	0.0073
schreiber2018/ismir2018	0.0944	0.2762	0.0103	0.0182
boeck2019/multi_task	0.1616	0.3413	0.0259	0.0695
schreiber2017/mirex2017	0.2607	0.4281	0.0136	0.0392
davies2009/mirex_qm_tempotracker	0.3488	0.3984	0.1079	0.1661
schreiber2017/ismir2017	0.4076	0.4829	0.0136	0.0392
schreiber2014/default	0.5329	0.4926	0.0129	0.0308
percival2014/stem	0.7157	0.4466	0.0091	0.0056

Table 21: Mean AOE1/AOE2 for estimates compared to version 1.0 ordered by mean.

CSV JSON LATEX PICKLE

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 1.0

Figure 52: AOE₁ for estimates compared to version 1.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ distribution for 1.0

Figure 53: AOE₂ for estimates compared to version 1.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Mean AOE₁/AOE₂ Results for 2.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2019/multi_task_hjdb	0.0028	0.0023	0.0028	0.0023
boeck2020/dar	0.0067	0.0045	0.0067	0.0045
schreiber2018/fcn	0.0369	0.1801	0.0032	0.0032
schreiber2018/cnn	0.0579	0.2282	0.0027	0.0029
boeck2015/tempodetector2016_default	0.0831	0.2683	0.0083	0.0067
schreiber2018/ismir2018	0.0897	0.2779	0.0051	0.0168
boeck2019/multi_task	0.1593	0.3449	0.0221	0.0695
schreiber2017/mirex2017	0.2562	0.4318	0.0068	0.0398
davies2009/mirex_qm_tempotracker	0.3499	0.4007	0.1058	0.1665
schreiber2017/ismir2017	0.4049	0.4871	0.0068	0.0398
schreiber2014/default	0.5319	0.4977	0.0062	0.0312
percival2014/stem	0.7160	0.4506	0.0023	0.0018

Table 22: Mean AOE1/AOE2 for estimates compared to version 2.0 ordered by mean.

CSV JSON LATEX PICKLE

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 2.0

Figure 54: AOE₁ for estimates compared to version 2.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ distribution for 2.0

Figure 55: AOE₂ for estimates compared to version 2.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Mean AOE₁/AOE₂ Results for 3.0

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
boeck2020/dar	0.0087	0.0074	0.0087	0.0074
boeck2019/multi_task_hjdb	0.0095	0.0057	0.0095	0.0057
schreiber2018/fcn	0.0425	0.1769	0.0096	0.0068
schreiber2018/cnn	0.0644	0.2256	0.0097	0.0063
boeck2015/tempodetector2016_default	0.0760	0.2702	0.0023	0.0071
schreiber2018/ismir2018	0.0945	0.2761	0.0105	0.0182
boeck2019/multi_task	0.1616	0.3413	0.0259	0.0695
schreiber2017/mirex2017	0.2608	0.4282	0.0137	0.0391
davies2009/mirex_qm_tempotracker	0.3488	0.3982	0.1081	0.1662
schreiber2017/ismir2017	0.4077	0.4828	0.0137	0.0391
schreiber2014/default	0.5329	0.4926	0.0130	0.0308
percival2014/stem	0.7155	0.4465	0.0092	0.0056

Table 23: Mean AOE1/AOE2 for estimates compared to version 3.0 ordered by mean.

CSV JSON LATEX PICKLE

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for 3.0

Figure 56: AOE₁ for estimates compared to version 3.0. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ distribution for 3.0

Figure 57: AOE₂ for estimates compared to version 3.0. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

CSV JSON LATEX PICKLE SVG PDF PNG

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0040	0.0002	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	0.6132	0.1082	0.4682
boeck2019/multi_task	0.0040	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0029	0.0003	0.0000	0.0130
boeck2019/multi_task_hjdb	0.0002	0.0000	1.0000	0.0941	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0044	0.0000
boeck2020/dar	0.0002	0.0000	0.0941	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0039	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0876	0.0103	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.0876	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0029	0.0000	0.0000	0.0103	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.6132	0.0003	0.0002	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2313	0.1356
schreiber2018/fcn	0.1082	0.0000	0.0044	0.0039	0.0000	0.0000	0.0000	0.0000	0.0000	0.2313	1.0000	0.0054
schreiber2018/ismir2018	0.4682	0.0130	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1356	0.0054	1.0000

Table 24: Paired t-test p-values, using reference annotations 3.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0106	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.2739	0.0273	0.7960
boeck2019/multi_task	0.0106	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0040	0.0002	0.0000	0.0109
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0003	0.0041	0.0000
boeck2020/dar	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0007	0.0105	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.1146	0.0068	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.1146	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0040	0.0000	0.0000	0.0068	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.2739	0.0002	0.0003	0.0007	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2589	0.1171
schreiber2018/fcn	0.0273	0.0000	0.0041	0.0105	0.0000	0.0000	0.0000	0.0000	0.0000	0.2589	1.0000	0.0052
schreiber2018/ismir2018	0.7960	0.0109	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1171	0.0052	1.0000

Table 25: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0041	0.0002	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	0.6037	0.1060	0.4762
boeck2019/multi_task	0.0041	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0030	0.0003	0.0000	0.0129
boeck2019/multi_task_hjdb	0.0002	0.0000	1.0000	0.2211	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0042	0.0000
boeck2020/dar	0.0002	0.0000	0.2211	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0002	0.0040	0.0000
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0882	0.0101	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0015	0.0000	0.0000	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0000	0.0882	0.0000	0.0015	1.0000	0.0000	0.0000	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0030	0.0000	0.0000	0.0101	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000
schreiber2018/cnn	0.6037	0.0003	0.0002	0.0002	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.2336	0.1360
schreiber2018/fcn	0.1060	0.0000	0.0042	0.0040	0.0000	0.0000	0.0000	0.0000	0.0000	0.2336	1.0000	0.0055
schreiber2018/ismir2018	0.4762	0.0129	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1360	0.0055	1.0000

Table 26: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0004	0.0002	0.0000	0.0003	0.0106	0.0212	0.0212	0.0005	0.0004	0.0012
boeck2019/multi_task_hjdb	0.0000	0.0004	1.0000	0.0941	0.0000	0.0999	0.0821	0.1018	0.1018	0.3790	0.7426	0.3823
boeck2020/dar	0.0000	0.0002	0.0941	1.0000	0.0000	0.2577	0.0309	0.0499	0.0499	0.0660	0.1183	0.1492
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0003	0.0999	0.2577	0.0000	1.0000	0.0608	0.0786	0.0786	0.0434	0.1318	0.2525
schreiber2014/default	0.0000	0.0106	0.0821	0.0309	0.0000	0.0608	1.0000	0.6556	0.6556	0.1070	0.0929	0.2851
schreiber2017/ismir2017	0.0000	0.0212	0.1018	0.0499	0.0000	0.0786	0.6556	1.0000	0.3734	0.1213	0.1076	0.2516
schreiber2017/mirex2017	0.0000	0.0212	0.1018	0.0499	0.0000	0.0786	0.6556	0.3734	1.0000	0.1213	0.1076	0.2516
schreiber2018/cnn	0.0000	0.0005	0.3790	0.0660	0.0000	0.0434	0.1070	0.1213	0.1213	1.0000	0.6820	0.4906
schreiber2018/fcn	0.0000	0.0004	0.7426	0.1183	0.0000	0.1318	0.0929	0.1076	0.1076	0.6820	1.0000	0.4153
schreiber2018/ismir2018	0.0000	0.0012	0.3823	0.1492	0.0000	0.2525	0.2851	0.2516	0.2516	0.4906	0.4153	1.0000

Table 27: Paired t-test p-values, using reference annotations 3.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0030	0.0000	0.0011	0.0000	0.0000	0.3307	0.5766	0.5766	0.0000	0.0000	0.0068
boeck2019/multi_task	0.0030	1.0000	0.0000	0.0009	0.0000	0.0000	0.0017	0.0039	0.0039	0.0000	0.0000	0.0003
boeck2019/multi_task_hjdb	0.0000	0.0000	1.0000	0.0000	0.0000	0.0050	0.0896	0.1191	0.1191	0.7010	0.1546	0.0395
boeck2020/dar	0.0011	0.0009	0.0000	1.0000	0.0000	0.0000	0.8203	0.9608	0.9608	0.0000	0.0000	0.1690
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	0.0050	0.0000	0.0000	1.0000	0.0557	0.0853	0.0853	0.0727	0.0004	0.0127
schreiber2014/default	0.3307	0.0017	0.0896	0.8203	0.0000	0.0557	1.0000	0.7220	0.7220	0.0864	0.1300	0.6372
schreiber2017/ismir2017	0.5766	0.0039	0.1191	0.9608	0.0000	0.0853	0.7220	1.0000	0.9513	0.1164	0.1590	0.5531
schreiber2017/mirex2017	0.5766	0.0039	0.1191	0.9608	0.0000	0.0853	0.7220	0.9513	1.0000	0.1164	0.1590	0.5531
schreiber2018/cnn	0.0000	0.0000	0.7010	0.0000	0.0000	0.0727	0.0864	0.1164	0.1164	1.0000	0.0388	0.0314
schreiber2018/fcn	0.0000	0.0000	0.1546	0.0000	0.0000	0.0004	0.1300	0.1590	0.1590	0.0388	1.0000	0.0715
schreiber2018/ismir2018	0.0068	0.0003	0.0395	0.1690	0.0000	0.0127	0.6372	0.5531	0.5531	0.0314	0.0715	1.0000

Table 28: Paired t-test p-values, using reference annotations 2.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

Estimator	boeck2015/tempodetector2016_default	boeck2019/multi_task	boeck2019/multi_task_hjdb	boeck2020/dar	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn	schreiber2018/ismir2018
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
boeck2019/multi_task	0.0000	1.0000	0.0003	0.0003	0.0000	0.0003	0.0101	0.0202	0.0202	0.0004	0.0004	0.0011
boeck2019/multi_task_hjdb	0.0000	0.0003	1.0000	0.2211	0.0000	0.2051	0.0753	0.0959	0.0959	0.2228	0.3526	0.3654
boeck2020/dar	0.0000	0.0003	0.2211	1.0000	0.0000	0.4141	0.0367	0.0565	0.0565	0.1113	0.1343	0.1938
davies2009/mirex_qm_tempotracker	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0003	0.2051	0.4141	0.0000	1.0000	0.0603	0.0792	0.0792	0.0379	0.0591	0.2658
schreiber2014/default	0.0000	0.0101	0.0753	0.0367	0.0000	0.0603	1.0000	0.6558	0.6558	0.1070	0.1014	0.2765
schreiber2017/ismir2017	0.0000	0.0202	0.0959	0.0565	0.0000	0.0792	0.6558	1.0000	0.3734	0.1227	0.1164	0.2466
schreiber2017/mirex2017	0.0000	0.0202	0.0959	0.0565	0.0000	0.0792	0.6558	0.3734	1.0000	0.1227	0.1164	0.2466
schreiber2018/cnn	0.0000	0.0004	0.2228	0.1113	0.0000	0.0379	0.1070	0.1227	0.1227	1.0000	0.9102	0.5143
schreiber2018/fcn	0.0000	0.0004	0.3526	0.1343	0.0000	0.0591	0.1014	0.1164	0.1164	0.9102	1.0000	0.4837
schreiber2018/ismir2018	0.0000	0.0011	0.3654	0.1938	0.0000	0.2658	0.2765	0.2466	0.2466	0.5143	0.4837	1.0000

Table 29: Paired t-test p-values, using reference annotations 1.0 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

CSV JSON LATEX PICKLE

AOE₁ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 58: Mean AOE₁ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 59: Mean AOE₁ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 60: Mean AOE₁ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on c_var-Subsets

How well does an estimator perform, when only taking tracks into account that have a c_var-value of less than τ, i.e., have a more or less stable beat?

AOE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Figure 61: Mean AOE₂ compared to version 1.0 for tracks with c_var < τ based on beat annotations from 1.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Figure 62: Mean AOE₂ compared to version 2.0 for tracks with c_var < τ based on beat annotations from 2.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Figure 63: Mean AOE₂ compared to version 3.0 for tracks with c_var < τ based on beat annotations from 3.0.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₁ on Tempo-Subsets for 1.0

Figure 64: Mean AOE₁ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ on Tempo-Subsets for 2.0

Figure 65: Mean AOE₁ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₁ on Tempo-Subsets for 3.0

Figure 66: Mean AOE₁ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₂ on Tempo-Subsets for 1.0

Figure 67: Mean AOE₂ for estimates compared to version 1.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on Tempo-Subsets for 2.0

Figure 68: Mean AOE₂ for estimates compared to version 2.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

AOE₂ on Tempo-Subsets for 3.0

Figure 69: Mean AOE₂ for estimates compared to version 3.0 for tempo intervals around T.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₁ for Tempo

When fitting a generalized additive model (GAM) to AOE₁-values and a ground truth, what AOE₁ can we expect with confidence?

Estimated AOE₁ for Tempo for 1.0

Predictions of GAMs trained on AOE₁ for estimates for reference 1.0.

Figure 70: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₁ for Tempo for 2.0

Predictions of GAMs trained on AOE₁ for estimates for reference 2.0.

Figure 71: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₁ for Tempo for 3.0

Predictions of GAMs trained on AOE₁ for estimates for reference 3.0.

Figure 72: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₂ for Tempo

When fitting a generalized additive model (GAM) to AOE₂-values and a ground truth, what AOE₂ can we expect with confidence?

Estimated AOE₂ for Tempo for 1.0

Predictions of GAMs trained on AOE₂ for estimates for reference 1.0.

Figure 73: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 1.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₂ for Tempo for 2.0

Predictions of GAMs trained on AOE₂ for estimates for reference 2.0.

Figure 74: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 2.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Estimated AOE₂ for Tempo for 3.0

Predictions of GAMs trained on AOE₂ for estimates for reference 3.0.

Figure 75: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for 3.0. The 95% confidence interval around the prediction is shaded in gray.

CSV JSON LATEX PICKLE SVG PDF PNG

Generated by tempo_eval 0.1.1 on 2022-06-29 18:46. Size L.

hjdb

Table of Contents

References for ‘hjdb’

References

1.0

2.0

3.0

Basic Statistics

Smoothed Tempo Distribution

Beat-Based Tempo Variation

Estimates for ‘hjdb’

Estimators

boeck2015/tempodetector2016_default

boeck2019/multi_task

boeck2019/multi_task_hjdb

boeck2020/dar

davies2009/mirex_qm_tempotracker

percival2014/stem

schreiber2014/default

schreiber2017/ismir2017

schreiber2017/mirex2017

schreiber2018/cnn

schreiber2018/fcn

schreiber2018/ismir2018

Basic Statistics

Smoothed Tempo Distribution

Accuracy

Accuracy Results for 1.0

Accuracy1 for 1.0

Accuracy2 for 1.0

Accuracy Results for 2.0

Accuracy1 for 2.0

Accuracy2 for 2.0

Accuracy Results for 3.0

Accuracy1 for 3.0

Accuracy2 for 3.0

Differing Items

Differing Items Accuracy1

Differing Items Accuracy2

Significance of Differences

Accuracy1 on cvar-Subsets

Accuracy1 on cvar-Subsets for 1.0 based on cvar-Values from 1.0

Accuracy1 on cvar-Subsets for 2.0 based on cvar-Values from 1.0

Accuracy1 on cvar-Subsets for 3.0 based on cvar-Values from 1.0

Accuracy2 on cvar-Subsets

Accuracy2 on cvar-Subsets for 1.0 based on cvar-Values from 1.0

Accuracy2 on cvar-Subsets for 2.0 based on cvar-Values from 1.0

Accuracy2 on cvar-Subsets for 3.0 based on cvar-Values from 1.0

Accuracy1 on Tempo-Subsets

Accuracy1 on Tempo-Subsets for 1.0

Accuracy1 on Tempo-Subsets for 2.0

Accuracy1 on Tempo-Subsets for 3.0

Accuracy2 on Tempo-Subsets

Accuracy2 on Tempo-Subsets for 1.0

Accuracy2 on Tempo-Subsets for 2.0

Accuracy2 on Tempo-Subsets for 3.0

Estimated Accuracy1 for Tempo

Estimated Accuracy1 for Tempo for 1.0

Estimated Accuracy1 for Tempo for 2.0

Estimated Accuracy1 for Tempo for 3.0

Estimated Accuracy2 for Tempo

Estimated Accuracy2 for Tempo for 1.0

Estimated Accuracy2 for Tempo for 2.0

Estimated Accuracy2 for Tempo for 3.0

OE1 and OE2

Mean OE1/OE2 Results for 1.0

OE1 distribution for 1.0

OE2 distribution for 1.0

Mean OE1/OE2 Results for 2.0

OE1 distribution for 2.0

OE2 distribution for 2.0

Mean OE1/OE2 Results for 3.0

OE1 distribution for 3.0

OE2 distribution for 3.0

Significance of Differences

OE1 on cvar-Subsets

OE1 on cvar-Subsets for 1.0 based on cvar-Values from 1.0

OE1 on cvar-Subsets for 2.0 based on cvar-Values from 1.0

OE1 on cvar-Subsets for 3.0 based on cvar-Values from 1.0

OE2 on cvar-Subsets

Accuracy₁ for 1.0

Accuracy₂ for 1.0

Accuracy₁ for 2.0

Accuracy₂ for 2.0

Accuracy₁ for 3.0

Accuracy₂ for 3.0

Differing Items Accuracy₁

Differing Items Accuracy₂

Accuracy₁ on c_var-Subsets

Accuracy₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Accuracy₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Accuracy₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Accuracy₂ on c_var-Subsets

Accuracy₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

Accuracy₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

Accuracy₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

Accuracy₁ on Tempo-Subsets

Accuracy₁ on Tempo-Subsets for 1.0

Accuracy₁ on Tempo-Subsets for 2.0

Accuracy₁ on Tempo-Subsets for 3.0

Accuracy₂ on Tempo-Subsets

Accuracy₂ on Tempo-Subsets for 1.0

Accuracy₂ on Tempo-Subsets for 2.0

Accuracy₂ on Tempo-Subsets for 3.0

Estimated Accuracy₁ for Tempo

Estimated Accuracy₁ for Tempo for 1.0

Estimated Accuracy₁ for Tempo for 2.0

Estimated Accuracy₁ for Tempo for 3.0

Estimated Accuracy₂ for Tempo

Estimated Accuracy₂ for Tempo for 1.0

Estimated Accuracy₂ for Tempo for 2.0

Estimated Accuracy₂ for Tempo for 3.0

OE₁ and OE₂

Mean OE₁/OE₂ Results for 1.0

OE₁ distribution for 1.0

OE₂ distribution for 1.0

Mean OE₁/OE₂ Results for 2.0

OE₁ distribution for 2.0

OE₂ distribution for 2.0

Mean OE₁/OE₂ Results for 3.0

OE₁ distribution for 3.0

OE₂ distribution for 3.0

OE₁ on c_var-Subsets

OE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

OE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

OE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

OE₂ on c_var-Subsets

OE₂ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

OE₂ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

OE₂ on c_var-Subsets for 3.0 based on c_var-Values from 1.0

OE₁ on Tempo-Subsets

OE₁ on Tempo-Subsets for 1.0

OE₁ on Tempo-Subsets for 2.0

OE₁ on Tempo-Subsets for 3.0

OE₂ on Tempo-Subsets

OE₂ on Tempo-Subsets for 1.0

OE₂ on Tempo-Subsets for 2.0

OE₂ on Tempo-Subsets for 3.0

Estimated OE₁ for Tempo

Estimated OE₁ for Tempo for 1.0

Estimated OE₁ for Tempo for 2.0

Estimated OE₁ for Tempo for 3.0

Estimated OE₂ for Tempo

Estimated OE₂ for Tempo for 1.0

Estimated OE₂ for Tempo for 2.0

Estimated OE₂ for Tempo for 3.0

AOE₁ and AOE₂

Mean AOE₁/AOE₂ Results for 1.0

AOE₁ distribution for 1.0

AOE₂ distribution for 1.0

Mean AOE₁/AOE₂ Results for 2.0

AOE₁ distribution for 2.0

AOE₂ distribution for 2.0

Mean AOE₁/AOE₂ Results for 3.0

AOE₁ distribution for 3.0

AOE₂ distribution for 3.0

AOE₁ on c_var-Subsets

AOE₁ on c_var-Subsets for 1.0 based on c_var-Values from 1.0

AOE₁ on c_var-Subsets for 2.0 based on c_var-Values from 1.0

AOE₁ on c_var-Subsets for 3.0 based on c_var-Values from 1.0