fma_small

Smoothed Tempo Distribution

Figure 1: Percentage of values in tempo interval.

Tag Distribution for ‘tag_fma_genre’

Figure 2: Percentage of tracks tagged with tags from namespace ‘tag_fma_genre’. Annotations are from reference 1.0.0.

Estimates for ‘fma_small’

Estimators

boeck2015/tempodetector2016_default

Attribute	Value
Corpus	fma_small
Version	0.17.dev0
Annotation Tools	TempoDetector.2016, madmom, https://github.com/CPJKU/madmom
Annotator, bibtex	Boeck2015

davies2009/mirex_qm_tempotracker

Attribute	Value
Corpus	fma_small
Version	1.0
Annotation Tools	QM Tempotracker, Sonic Annotator plugin. https://code.soundsoftware.ac.uk/projects/mirex2013/repository/show/audio_tempo_estimation/qm-tempotracker Note that the current macOS build of ‘qm-vamp-plugins’ was used.
Annotator, bibtex	Davies2009	Davies2007

percival2014/stem

Attribute	Value
Corpus	fma_small
Version	1.0
Annotation Tools	percival 2014, ‘tempo’ implementation from Marsyas, http://marsyas.info, git checkout tempo-stem
Annotator, bibtex	Percival2014

schreiber2014/default

Attribute	Value
Corpus	fma_small
Version	0.0.1
Annotation Tools	schreiber 2014, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2014

schreiber2017/ismir2017

Attribute	Value
Corpus	fma_small
Version	0.0.4
Annotation Tools	schreiber 2017, model=ismir2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2017/mirex2017

Attribute	Value
Corpus	fma_small
Version	0.0.4
Annotation Tools	schreiber 2017, model=mirex2017, http://www.tagtraum.com/tempo_estimation.html
Annotator, bibtex	Schreiber2017

schreiber2018/cnn

Attribute	Value
Corpus	fma_small
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=cnn), https://github.com/hendriks73/tempo-cnn

schreiber2018/fcn

Attribute	Value
Corpus	fma_small
Version	0.0.2
Data Source	Hendrik Schreiber, Meinard Müller. A Single-Step Approach to Musical Tempo Estimation Using a Convolutional Neural Network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), Paris, France, Sept. 2018.
Annotation Tools	schreiber tempo-cnn (model=fcn), https://github.com/hendriks73/tempo-cnn

Basic Statistics

Estimator	Size	Min	Max	Avg	Stdev	Sweet Oct. Start	Sweet Oct. Coverage
boeck2015/tempodetector2016_default	7997	40.00	240.00	118.87	39.52	79.00	0.71
davies2009/mirex_qm_tempotracker	7970	60.09	246.09	123.15	28.40	84.00	0.89
percival2014/stem	7997	50.17	inf	inf	nan	71.00	0.87
schreiber2014/default	7994	34.99	177.59	102.26	23.82	71.00	0.86
schreiber2017/ismir2017	7996	20.03	208.10	103.29	25.42	72.00	0.85
schreiber2017/mirex2017	7996	10.01	200.14	105.00	29.66	73.00	0.79
schreiber2018/cnn	7997	41.00	232.00	114.44	31.45	77.00	0.80
schreiber2018/fcn	7997	35.00	232.00	110.99	32.01	76.00	0.78

Table 2: Basic statistics.

Smoothed Tempo Distribution

Figure 3: Percentage of values in tempo interval.

Accuracy

Accuracy₁ is defined as the percentage of correct estimates, allowing a 4% tolerance for individual BPM values.

Accuracy₂ additionally permits estimates to be wrong by a factor of 2, 3, 1/2 or 1/3 (so-called octave errors).

See [Gouyon2006].

Note: When comparing accuracy values for different algorithms, keep in mind that an algorithm may have been trained on the test set or that the test set may have even been created using one of the tested algorithms.

Accuracy Results for schreiber2018/ismir2018

Estimator	Accuracy1	Accuracy2
schreiber2018/cnn	0.7757	0.8798
schreiber2018/fcn	0.7389	0.8812
schreiber2017/ismir2017	0.7219	0.8614
schreiber2017/mirex2017	0.7194	0.8637
schreiber2014/default	0.7086	0.8637
boeck2015/tempodetector2016_default	0.6913	0.8727
percival2014/stem	0.6860	0.8442
davies2009/mirex_qm_tempotracker	0.6541	0.7953

Table 3: Mean accuracy of estimates compared to version schreiber2018/ismir2018 with 4% tolerance ordered by Accuracy₁.

Raw data Accuracy₁: CSV JSON LATEX PICKLE

Raw data Accuracy₂: CSV JSON LATEX PICKLE

Accuracy₁ for schreiber2018/ismir2018

Figure 4: Mean Accuracy₁ for estimates compared to version schreiber2018/ismir2018 depending on tolerance.

Accuracy₂ for schreiber2018/ismir2018

Figure 5: Mean Accuracy₂ for estimates compared to version schreiber2018/ismir2018 depending on tolerance.

Differing Items

For which items did a given estimator not estimate a correct value with respect to a given ground truth? Are there items which are either very difficult, not suitable for the task, or incorrectly annotated and therefore never estimated correctly, regardless which estimator is used?

Differing Items Accuracy₁

Items with different tempo annotations (Accuracy₁, 4% tolerance) in different versions:

schreiber2018/ismir2018 compared with boeck2015/tempodetector2016_default (2469 differences): ‘000/000140’ ‘000/000148’ ‘000/000193’ ‘000/000197’ ‘000/000200’ ‘000/000207’ ‘000/000210’ ‘000/000256’ ‘000/000424’ ‘000/000534’ ‘000/000540’ … CSV

schreiber2018/ismir2018 compared with davies2009/mirex_qm_tempotracker (2766 differences): ‘000/000148’ ‘000/000194’ ‘000/000197’ ‘000/000200’ ‘000/000207’ ‘000/000210’ ‘000/000256’ ‘000/000424’ ‘000/000534’ ‘000/000540’ ‘000/000602’ … CSV

schreiber2018/ismir2018 compared with percival2014/stem (2511 differences): ‘000/000141’ ‘000/000194’ ‘000/000197’ ‘000/000200’ ‘000/000204’ ‘000/000210’ ‘000/000213’ ‘000/000256’ ‘000/000424’ ‘000/000534’ ‘000/000540’ … CSV

schreiber2018/ismir2018 compared with schreiber2014/default (2330 differences): ‘000/000140’ ‘000/000148’ ‘000/000182’ ‘000/000200’ ‘000/000540’ ‘000/000574’ ‘000/000615’ ‘000/000621’ ‘000/000676’ ‘000/000714’ ‘000/000715’ … CSV

schreiber2018/ismir2018 compared with schreiber2017/ismir2017 (2224 differences): ‘000/000141’ ‘000/000148’ ‘000/000194’ ‘000/000207’ ‘000/000424’ ‘000/000540’ ‘000/000574’ ‘000/000620’ ‘000/000676’ ‘000/000714’ ‘000/000715’ … CSV

schreiber2018/ismir2018 compared with schreiber2017/mirex2017 (2244 differences): ‘000/000141’ ‘000/000148’ ‘000/000182’ ‘000/000194’ ‘000/000197’ ‘000/000207’ ‘000/000424’ ‘000/000540’ ‘000/000667’ ‘000/000676’ ‘000/000714’ … CSV

schreiber2018/ismir2018 compared with schreiber2018/cnn (1794 differences): ‘000/000005’ ‘000/000148’ ‘000/000193’ ‘000/000207’ ‘000/000424’ ‘000/000459’ ‘000/000534’ ‘000/000540’ ‘000/000546’ ‘000/000615’ ‘000/000625’ … CSV

schreiber2018/ismir2018 compared with schreiber2018/fcn (2088 differences): ‘000/000141’ ‘000/000148’ ‘000/000193’ ‘000/000197’ ‘000/000207’ ‘000/000213’ ‘000/000424’ ‘000/000459’ ‘000/000540’ ‘000/000602’ ‘000/000615’ … CSV

None of the estimators estimated the following 448 items ‘correctly’ using Accuracy₁: ‘000/000540’ ‘000/000676’ ‘000/000714’ ‘000/000715’ ‘000/000718’ ‘000/000890’ ‘001/001197’ ‘001/001249’ ‘001/001673’ ‘003/003263’ ‘003/003534’ … CSV

Differing Items Accuracy₂

Items with different tempo annotations (Accuracy₂, 4% tolerance) in different versions:

schreiber2018/ismir2018 compared with boeck2015/tempodetector2016_default (1018 differences): ‘000/000140’ ‘000/000148’ ‘000/000197’ ‘000/000207’ ‘000/000424’ ‘000/000540’ ‘000/000615’ ‘000/000676’ ‘000/000690’ ‘000/000714’ ‘000/000715’ … CSV

schreiber2018/ismir2018 compared with davies2009/mirex_qm_tempotracker (1637 differences): ‘000/000148’ ‘000/000194’ ‘000/000197’ ‘000/000200’ ‘000/000210’ ‘000/000256’ ‘000/000424’ ‘000/000534’ ‘000/000540’ ‘000/000615’ ‘000/000676’ … CSV

schreiber2018/ismir2018 compared with percival2014/stem (1246 differences): ‘000/000141’ ‘000/000194’ ‘000/000197’ ‘000/000200’ ‘000/000204’ ‘000/000210’ ‘000/000424’ ‘000/000534’ ‘000/000540’ ‘000/000602’ ‘000/000615’ … CSV

schreiber2018/ismir2018 compared with schreiber2014/default (1090 differences): ‘000/000140’ ‘000/000148’ ‘000/000200’ ‘000/000540’ ‘000/000574’ ‘000/000615’ ‘000/000621’ ‘000/000714’ ‘000/000715’ ‘000/000814’ ‘000/000890’ … CSV

schreiber2018/ismir2018 compared with schreiber2017/ismir2017 (1108 differences): ‘000/000141’ ‘000/000148’ ‘000/000194’ ‘000/000424’ ‘000/000540’ ‘000/000714’ ‘000/000715’ ‘000/000718’ ‘000/000814’ ‘000/000890’ ‘000/000892’ … CSV

schreiber2018/ismir2018 compared with schreiber2017/mirex2017 (1090 differences): ‘000/000141’ ‘000/000148’ ‘000/000194’ ‘000/000424’ ‘000/000540’ ‘000/000714’ ‘000/000715’ ‘000/000718’ ‘000/000814’ ‘000/000890’ ‘000/000892’ … CSV

schreiber2018/ismir2018 compared with schreiber2018/cnn (961 differences): ‘000/000148’ ‘000/000207’ ‘000/000424’ ‘000/000534’ ‘000/000540’ ‘000/000615’ ‘000/000625’ ‘000/000690’ ‘000/000714’ ‘000/000715’ ‘000/000716’ … CSV

schreiber2018/ismir2018 compared with schreiber2018/fcn (950 differences): ‘000/000148’ ‘000/000197’ ‘000/000424’ ‘000/000540’ ‘000/000602’ ‘000/000615’ ‘000/000625’ ‘000/000690’ ‘000/000714’ ‘000/000715’ ‘000/000716’ … CSV

None of the estimators estimated the following 285 items ‘correctly’ using Accuracy₂: ‘000/000540’ ‘000/000714’ ‘000/000715’ ‘000/000890’ ‘001/001197’ ‘001/001249’ ‘001/001673’ ‘003/003263’ ‘003/003537’ ‘003/003832’ ‘004/004017’ … CSV

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0000	0.3628	0.0029	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.3628	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0029	0.0000	0.0000	1.0000	0.0041	0.0317	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0041	1.0000	0.4764	0.0000	0.0018
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.0317	0.4764	1.0000	0.0000	0.0002
schreiber2018/cnn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0000	0.0018	0.0002	0.0000	1.0000

Table 4: McNemar p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with Accuracy₁ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0052	0.0005	0.0050	0.0306	0.0093
davies2009/mirex_qm_tempotracker	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0052	0.0000	0.0000	1.0000	0.3508	0.9562	0.0000	0.0000
schreiber2017/ismir2017	0.0005	0.0000	0.0000	0.3508	1.0000	0.0356	0.0000	0.0000
schreiber2017/mirex2017	0.0050	0.0000	0.0000	0.9562	0.0356	1.0000	0.0000	0.0000
schreiber2018/cnn	0.0306	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.6767
schreiber2018/fcn	0.0093	0.0000	0.0000	0.0000	0.0000	0.0000	0.6767	1.0000

Table 5: McNemar p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with Accuracy₂ [Gouyon2006]. H₀: both estimators disagree with the groundtruth to the same amount. If p<=ɑ, reject H₀, i.e. we have a significant difference in the disagreement with the groundtruth. In the table, p-values<0.05 are set in bold.

Accuracy₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₁ on Tempo-Subsets for schreiber2018/ismir2018

Figure 6: Mean Accuracy₁ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Accuracy₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Accuracy₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Accuracy₂ on Tempo-Subsets for schreiber2018/ismir2018

Figure 7: Mean Accuracy₂ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Estimated Accuracy₁ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₁-values and a ground truth, what Accuracy₁ can we expect with confidence?

Estimated Accuracy₁ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on Accuracy₁ for estimates for reference schreiber2018/ismir2018.

Figure 8: Accuracy₁ predictions of a generalized additive model (GAM) fit to Accuracy₁ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Estimated Accuracy₂ for Tempo

When fitting a generalized additive model (GAM) to Accuracy₂-values and a ground truth, what Accuracy₂ can we expect with confidence?

Estimated Accuracy₂ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on Accuracy₂ for estimates for reference schreiber2018/ismir2018.

Figure 9: Accuracy₂ predictions of a generalized additive model (GAM) fit to Accuracy₂ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Accuracy₁ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₁ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 10: Mean Accuracy₁ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

Accuracy₂ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Accuracy₂ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 11: Mean Accuracy₂ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

MIREX-Style Evaluation

P-Score is defined as the average of two tempi weighted by their perceptual strength, allowing an 8% tolerance for both tempo values [MIREX 2006 Definition].

One Correct is the fraction of estimate pairs of which at least one of the two values is equal to a reference value (within an 8% tolerance).

Both Correct is the fraction of estimate pairs of which both values are equal to the reference values (within an 8% tolerance).

See [McKinney2007].

Note: Very few datasets actually have multiple annotations per track along with a salience distributions. References without suitable annotations are not shown.

MIREX Results for schreiber2018/ismir2018

Estimator	P-Score	One Correct	Both Correct
schreiber2018/fcn	0.9014	0.9619	0.6694
schreiber2018/cnn	0.8998	0.9601	0.6805
schreiber2014/default	0.8674	0.9298	0.6620
schreiber2017/ismir2017	0.8666	0.9295	0.6624
schreiber2017/mirex2017	0.8582	0.9185	0.6471
boeck2015/tempodetector2016_default	0.8455	0.9252	0.5156
davies2009/mirex_qm_tempotracker	0.7828	0.8730	0.4542
percival2014/stem	0.6710	0.8661	0.0121

Table 6: Compared to schreiber2018/ismir2018 with 8.0% tolerance.

Raw data P-Score: CSV JSON LATEX PICKLE

Raw data One Correct: CSV JSON LATEX PICKLE

Raw data Both Correct: CSV JSON LATEX PICKLE

P-Score for schreiber2018/ismir2018

Figure 12: Mean P-Score for estimates compared to version schreiber2018/ismir2018 depending on tolerance.

One Correct for schreiber2018/ismir2018

Figure 13: Mean One Correct for estimates compared to version schreiber2018/ismir2018 depending on tolerance.

Both Correct for schreiber2018/ismir2018

Figure 14: Mean Both Correct for estimates compared to version schreiber2018/ismir2018 depending on tolerance.

P-Score on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean P-Score for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

P-Score on Tempo-Subsets for schreiber2018/ismir2018

Figure 15: Mean P-Score for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

One Correct on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean One Correct for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

One Correct on Tempo-Subsets for schreiber2018/ismir2018

Figure 16: Mean One Correct for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Both Correct on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean Both Correct for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

Both Correct on Tempo-Subsets for schreiber2018/ismir2018

Figure 17: Mean Both Correct for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Estimated P-Score for Tempo

When fitting a generalized additive model (GAM) to P-Score-values and a ground truth, what P-Score can we expect with confidence?

Estimated P-Score for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on P-Score for estimates for reference schreiber2018/ismir2018.

Figure 18: P-Score predictions of a generalized additive model (GAM) fit to P-Score results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Estimated One Correct for Tempo

When fitting a generalized additive model (GAM) to One Correct-values and a ground truth, what One Correct can we expect with confidence?

Estimated One Correct for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on One Correct for estimates for reference schreiber2018/ismir2018.

Figure 19: One Correct predictions of a generalized additive model (GAM) fit to One Correct results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Estimated Both Correct for Tempo

When fitting a generalized additive model (GAM) to Both Correct-values and a ground truth, what Both Correct can we expect with confidence?

Estimated Both Correct for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on Both Correct for estimates for reference schreiber2018/ismir2018.

Figure 20: Both Correct predictions of a generalized additive model (GAM) fit to Both Correct results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

P-Score for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

P-Score for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 21: Mean P-Score of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

One Correct for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

One Correct for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 22: Mean One Correct of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

Both Correct for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

Both Correct for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 23: Mean Both Correct of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

OE₁ and OE₂

OE₁ is defined as octave error between an estimate E and a reference value R.This means that the most common errors—by a factor of 2 or ½—have the same magnitude, namely 1: OE₂(E) = log₂(E/R).

OE₂ is the signed OE₁ corresponding to the minimum absolute OE₁ allowing the octaveerrors 2, 3, 1/2, and 1/3: OE₂(E) = arg min_x(|x|) with x ∈ {OE₁(E), OE₁(2E), OE₁(3E), OE₁(½E), OE₁(⅓E)}

Mean OE₁/OE₂ Results for schreiber2018/ismir2018

Estimator	OE1_MEAN	OE1_STDEV	OE2_MEAN	OE2_STDEV
schreiber2018/cnn	0.0306	0.3631	0.0096	0.1084
schreiber2014/default	-0.1181	0.4163	-0.0045	0.1187
schreiber2018/fcn	-0.0213	0.4234	0.0001	0.1102
schreiber2017/ismir2017	-0.1099	0.4295	-0.0033	0.1237
davies2009/mirex_qm_tempotracker	0.1518	0.4341	0.0366	0.1452
percival2014/stem	-0.1334	0.4610	0.0133	0.1706
schreiber2017/mirex2017	-0.1064	0.4837	-0.0050	0.1312
boeck2015/tempodetector2016_default	0.0603	0.4955	-0.0006	0.1143

Table 7: Mean OE1/OE2 for estimates compared to version schreiber2018/ismir2018 ordered by standard deviation.

Raw data OE₁: CSV JSON LATEX PICKLE

Raw data OE₂: CSV JSON LATEX PICKLE

OE₁ distribution for schreiber2018/ismir2018

Figure 24: OE₁ for estimates compared to version schreiber2018/ismir2018. Shown are the mean OE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

OE₂ distribution for schreiber2018/ismir2018

Figure 25: OE₂ for estimates compared to version schreiber2018/ismir2018. Shown are the mean OE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	1.0000	0.0020	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0020	1.0000	0.0654	0.0283	0.0000	0.0000
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.0654	1.0000	0.3823	0.0000	0.0000
schreiber2017/mirex2017	0.0000	0.0000	0.0000	0.0283	0.3823	1.0000	0.0000	0.0000
schreiber2018/cnn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000

Table 8: Paired t-test p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with OE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.0158	0.1125	0.0131	0.0000	0.6413
davies2009/mirex_qm_tempotracker	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0802	0.0000
schreiber2014/default	0.0158	0.0000	0.0000	1.0000	0.2778	0.8961	0.0000	0.0048
schreiber2017/ismir2017	0.1125	0.0000	0.0000	0.2778	1.0000	0.0554	0.0000	0.0400
schreiber2017/mirex2017	0.0131	0.0000	0.0000	0.8961	0.0554	1.0000	0.0000	0.0031
schreiber2018/cnn	0.0000	0.0000	0.0802	0.0000	0.0000	0.0000	1.0000	0.0000
schreiber2018/fcn	0.6413	0.0000	0.0000	0.0048	0.0400	0.0031	0.0000	1.0000

Table 9: Paired t-test p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with OE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

OE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₁ on Tempo-Subsets for schreiber2018/ismir2018

Figure 26: Mean OE₁ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

OE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean OE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

OE₂ on Tempo-Subsets for schreiber2018/ismir2018

Figure 27: Mean OE₂ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Estimated OE₁ for Tempo

When fitting a generalized additive model (GAM) to OE₁-values and a ground truth, what OE₁ can we expect with confidence?

Estimated OE₁ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on OE₁ for estimates for reference schreiber2018/ismir2018.

Figure 28: OE₁ predictions of a generalized additive model (GAM) fit to OE₁ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Estimated OE₂ for Tempo

When fitting a generalized additive model (GAM) to OE₂-values and a ground truth, what OE₂ can we expect with confidence?

Estimated OE₂ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on OE₂ for estimates for reference schreiber2018/ismir2018.

Figure 29: OE₂ predictions of a generalized additive model (GAM) fit to OE₂ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

OE₁ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₁ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 30: OE₁ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

OE₂ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

OE₂ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 31: OE₂ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

AOE₁ and AOE₂

AOE₁ is defined as absolute octave error between an estimate and a reference value: AOE₁(E) = |log₂(E/R)|.

AOE₂ is the minimum of AOE₁ allowing the octave errors 2, 3, 1/2, and 1/3: AOE₂(E) = min(AOE₁(E), AOE₁(2E), AOE₁(3E), AOE₁(½E), AOE₁(⅓E)).

Mean AOE₁/AOE₂ Results for schreiber2018/ismir2018

Estimator	AOE1_MEAN	AOE1_STDEV	AOE2_MEAN	AOE2_STDEV
schreiber2018/cnn	0.1549	0.3298	0.0373	0.1023
schreiber2018/fcn	0.2000	0.3738	0.0388	0.1032
schreiber2017/ismir2017	0.2135	0.3886	0.0456	0.1151
schreiber2014/default	0.2152	0.3754	0.0441	0.1103
schreiber2017/mirex2017	0.2375	0.4346	0.0455	0.1232
percival2014/stem	0.2402	0.4155	0.0537	0.1625
davies2009/mirex_qm_tempotracker	0.2546	0.3830	0.0766	0.1287
boeck2015/tempodetector2016_default	0.2635	0.4240	0.0448	0.1052

Table 10: Mean AOE1/AOE2 for estimates compared to version schreiber2018/ismir2018 ordered by mean.

Raw data AOE₁: CSV JSON LATEX PICKLE

Raw data AOE₂: CSV JSON LATEX PICKLE

AOE₁ distribution for schreiber2018/ismir2018

Figure 32: AOE₁ for estimates compared to version schreiber2018/ismir2018. Shown are the mean AOE₁ and an empirical distribution of the sample, using kernel density estimation (KDE).

AOE₂ distribution for schreiber2018/ismir2018

Figure 33: AOE₂ for estimates compared to version schreiber2018/ismir2018. Shown are the mean AOE₂ and an empirical distribution of the sample, using kernel density estimation (KDE).

Significance of Differences

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0995	0.0001	0.0000	0.0000	0.0000	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0995	1.0000	0.0054	0.0000	0.0000	0.0013	0.0000	0.0000
percival2014/stem	0.0001	0.0054	1.0000	0.0000	0.0000	0.5452	0.0000	0.0000
schreiber2014/default	0.0000	0.0000	0.0000	1.0000	0.6534	0.0000	0.0000	0.0042
schreiber2017/ismir2017	0.0000	0.0000	0.0000	0.6534	1.0000	0.0000	0.0000	0.0094
schreiber2017/mirex2017	0.0000	0.0013	0.5452	0.0000	0.0000	1.0000	0.0000	0.0000
schreiber2018/cnn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.0000
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0042	0.0094	0.0000	0.0000	1.0000

Table 11: Paired t-test p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with AOE₁. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

Estimator	boeck2015/tempodetector2016_default	davies2009/mirex_qm_tempotracker	percival2014/stem	schreiber2014/default	schreiber2017/ismir2017	schreiber2017/mirex2017	schreiber2018/cnn	schreiber2018/fcn
boeck2015/tempodetector2016_default	1.0000	0.0000	0.0000	0.5732	0.4759	0.5330	0.0000	0.0000
davies2009/mirex_qm_tempotracker	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
percival2014/stem	0.0000	0.0000	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000
schreiber2014/default	0.5732	0.0000	0.0000	1.0000	0.1461	0.2780	0.0000	0.0000
schreiber2017/ismir2017	0.4759	0.0000	0.0000	0.1461	1.0000	0.9209	0.0000	0.0000
schreiber2017/mirex2017	0.5330	0.0000	0.0000	0.2780	0.9209	1.0000	0.0000	0.0000
schreiber2018/cnn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	1.0000	0.1267
schreiber2018/fcn	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.1267	1.0000

Table 12: Paired t-test p-values, using reference annotations schreiber2018/ismir2018 as groundtruth with AOE₂. H₀: the true mean difference between paired samples is zero. If p<=ɑ, reject H₀, i.e. we have a significant difference between estimates from the two algorithms. In the table, p-values<0.05 are set in bold.

AOE₁ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₁ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₁ on Tempo-Subsets for schreiber2018/ismir2018

Figure 34: Mean AOE₁ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

AOE₂ on Tempo-Subsets

How well does an estimator perform, when only taking a subset of the reference annotations into account? The graphs show mean AOE₂ for reference subsets with tempi in [T-10,T+10] BPM. Note that the graphs do not show confidence intervals and that some values may be based on very few estimates.

AOE₂ on Tempo-Subsets for schreiber2018/ismir2018

Figure 35: Mean AOE₂ for estimates compared to version schreiber2018/ismir2018 for tempo intervals around T.

Estimated AOE₁ for Tempo

When fitting a generalized additive model (GAM) to AOE₁-values and a ground truth, what AOE₁ can we expect with confidence?

Estimated AOE₁ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on AOE₁ for estimates for reference schreiber2018/ismir2018.

Figure 36: AOE₁ predictions of a generalized additive model (GAM) fit to AOE₁ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

Estimated AOE₂ for Tempo

When fitting a generalized additive model (GAM) to AOE₂-values and a ground truth, what AOE₂ can we expect with confidence?

Estimated AOE₂ for Tempo for schreiber2018/ismir2018

Predictions of GAMs trained on AOE₂ for estimates for reference schreiber2018/ismir2018.

Figure 37: AOE₂ predictions of a generalized additive model (GAM) fit to AOE₂ results for schreiber2018/ismir2018. The 95% confidence interval around the prediction is shaded in gray.

AOE₁ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₁ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 38: AOE₁ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.

AOE₂ for ‘tag_fma_genre’ Tags

How well does an estimator perform, when only taking tracks into account that are tagged with some kind of label? Note that some values may be based on very few estimates.

AOE₂ for ‘tag_fma_genre’ Tags for schreiber2018/ismir2018

Figure 39: AOE₂ of estimates compared to version schreiber2018/ismir2018 depending on tag from namespace ‘tag_fma_genre’.