Tabular data

We tested FEDOT on the results of AMLB benchmark. We used the setup of the framework obtained from ‘frameworks.yaml’ on the date of starts of experiments. So, the following stable versions were used: AutoGluon 0.7.0, TPOT 0.11.7, LightAutoML 0.3.7.3, v3.40.0.2, FEDOT 0.7.2. Some runs for AutoGluon are failed due to the errors (described also in Appendix D of AMLB paper [1]).

The visualization obtained using built-in visualizations of critical difference plot (CD) from AutoMLBenchmark [1].

In a CD (Critical Difference) diagram, we display each framework’s average rank and highlight which ranks are statistically significantly different from one another.

To determine the average rank per task, we first replace any missing values with a constant predictor, calculate ranks for represented AutoML solutions and constant predictor for each dataset and than took an average value of ranks across all datasets for each represented solution.

We assess statistical significance of the rank differences using a non-parametric Friedman test with a threshold of p < 0.05 (resulting in p ≈ 0 for all diagrams) and apply a Nemenyi post-hoc test to identify which framework pairs differ significantly.

Time budget for all experiments is 1 hour, 10 folds are used (1h8c setup for ALMB). The results are obtained using sever based on Xeon Cascadelake (2900MHz) with 12 cores and 16GB memory.

CD for all datasets (ROC AUC and negative log loss):

../_images/cd-all-1h8c-constantpredictor.png

The CD diagram for all datasets (ROC AUC and negative log loss) shows that all AutoML frameworks (LightAutoML, H2OAutoML, TPOT, AutoGluon, FEDOT) perform statistically better than constant predictor:

CD for binary classification (ROC AUC):

../_images/cd-binary-classification-1h8c-constantpredictor.png

The CD diagram for binary classification (ROC AUC) shows that all AutoML frameworks (LightAutoML, H2OAutoML, TPOT, AutoGluon, FEDOT) perform similarly, falling within the same CD interval, and significantly outperform the constant predictor:

CD for multiclass classification (negative logloss):

../_images/cd-multiclass-classification-1h8c-constantpredictor.png

The CD diagram for multiclass classification (negative log loss) shows that TPOT and Fedot demonstrate intermediate performance being on the border of the CD interval with constant predictor and the CD interval with H2OAutoML:

We can conclude that FEDOT achieves performance comparable with competitors for tabular tasks.

The ranks for frameworks are provided below:

Ranks on Binary Tasks
Task	FEDOT	AutoGluon(B)	H2OAutoML	LightAutoML	TPOT	ConstantPredictor
adult	4	5	2	1	3	6
airlines		2	1		3	4
amazon_employee_access	4	5	2	1	3	6
apsfailure	3	4	1	2	5	6
australian	3	2	4	1	5	6
bank-marketing	3	5	2	1	4	6
blood-transfusion	1	5	2	3	4	6
christine	3	5	2	1	4	6
credit-g	5	3	1	2	4	6
guillermo	3	2		1	4	5
jasmine	3	4	2	5	1	6
kc1	2	4		3	1	5
kddcup09_appetency	4	5	2	1	3	6
kr-vs-kp	3	5	2	4	1	6
miniboone	4	3		1	2	5
nomao	5	4	2	1	3	6
numerai28_6	3	4	2	1		5
phoneme	5	3	2	4	1	6
riccardo		1		2		3
sylvine	4	5	2	3	1	6

Ranks on MultiClass Tasks
Task	FEDOT	AutoGluon(B)	H2OAutoML	LightAutoML	TPOT	ConstantPredictor
car	3	4	2	1	5	6
cnae-9	3	5	4	2	1	6
connect-4	4	5	2	1	3	6
covertype	2	1	3			4
dilbert	4	3	2	1	5	6
fabert	5	2	3	1	4	6
fashion-mnist	3	2	4	1	5	6
helena		2	3	1	4	5
jannis	5	4	2	1	3	6
jungle_chess_2pcs_raw_end	2	5	4	1	3	6
mfeat-factors	3	5	2	1	4	6
robert		2		1		3
segment	3	5	1	2	4	6
shuttle	4	3	1	2		5
vehicle	2	5	1	4	3	6
volkert	4	2	3	1		5

The raw metrics (ROC AUC for binary and logloss for multiclass) for frameworks are provided below:

Binary (ROC AUC)
Task	FEDOT	AutoGluon(B)	H2OAutoML	LightAutoML	TPOT	ConstantPredictor
adult	0.929	0.910	0.931	0.932	0.927	0.500
airlines		0.725	0.730		0.694	0.500
amazon_employee_access	0.863	0.857	0.873	0.879	0.866	0.500
apsfailure	0.992	0.991	0.993	0.992	0.990	0.500
australian	0.939	0.940	0.938	0.945	0.936	0.500
bank-marketing	0.936	0.931	0.939	0.940	0.935	0.500
blood-transfusion	0.759	0.690	0.754	0.750	0.740	0.500
christine	0.817	0.804	0.815	0.830	0.807	0.500
credit-g	0.778	0.795	0.798	0.796	0.794	0.500
guillermo	0.891	0.900		0.926	0.783	0.500
jasmine	0.888	0.883	0.888	0.880	0.890	0.500
kc1	0.843	0.822		0.831	0.845	0.500
kddcup09_appetency	0.819	0.804	0.829	0.850	0.826	0.500
kr-vs-kp	1.000	0.999	1.000	1.000	1.000	0.500
miniboone	0.981	0.982		0.988	0.983	0.500
nomao	0.994	0.995	0.996	0.997	0.995	0.500
numerai28_6	0.531	0.517	0.531	0.531		0.500
phoneme	0.965	0.965	0.968	0.965	0.971	0.500
riccardo		1.000		1.000		0.500
sylvine	0.988	0.985	0.989	0.988	0.993	0.500

MultiClass (LogLoss)
Task	FEDOT	AutoGluon(B)	H2OAutoML	LightAutoML	TPOT	ConstantPredictor
car	0.011	0.117	0.003	0.002	0.643	0.840
cnae-9	0.211	0.332	0.262	0.156	0.154	2.200
connect-4	0.404	0.502	0.338	0.337	0.373	0.840
covertype	0.164	0.071	0.264			1.210
dilbert	0.040	0.148	0.103	0.033	0.168	1.610
fabert	0.859	0.788	0.792	0.766	0.892	1.870
fashion-mnist	0.388	0.333	0.383	0.252	0.535	2.300
helena		2.785	2.980	2.537	2.982	4.140
jannis	0.753	0.728	0.691	0.664	0.703	1.110
jungle_chess_2pcs_raw_end	0.349	0.431	0.240	0.149	0.219	0.940
mfeat-factors	0.089	0.161	0.093	0.082	0.107	2.300
robert		1.684		1.318		2.300
segment	0.062	0.094	0.060	0.061	0.077	1.950
shuttle	0.001	0.001	0.000	0.001		0.670
vehicle	0.354	0.515	0.331	0.404	0.392	1.390
volkert	1.040	0.920	0.978	0.812		2.050

The comparison with [1] shows that AutoGluon is underperforming in our hardware setup, while TPOT and H2O are quite close in both setups. To avoid any confusion, we provide below an additional comparison of the FEDOT metrics with the metrics from [1]. However, it should be noted that the conditions are different, as are the exact versions of the frameworks.

AutoML Benchmark Results (AUC)
Task	FEDOT	H2O	TPOT	AutoGluon(B)	LightAutoML	GAMA(B)	MLJAR(P)	FLAML
adult	0.929	0.931	0.927	0.932	0.932	0.929	0.931	0.932
airlines	0.716	0.731	0.722	0.732	0.727	0.717	0.730	0.731
albert	0.749	0.761	0.718	0.782	0.780	0.726	0.765	0.770
amazon_employee_access	0.863	0.877	0.864	0.902	0.879	0.867	0.903	0.876
apsfailure	0.992	0.993	0.989	0.993	0.993	0.990	0.992	0.992
australian	0.939	0.935	0.939	0.941	0.946	0.941	0.944	0.938
bank-marketing	0.936	0.938	0.935	0.941	0.940	0.936	0.940	0.937
blood-transfusion	0.759	0.764	0.724	0.758	0.753	0.753	0.753	0.730
christine	0.817	0.825	0.811	0.826	0.831	0.828	0.823	0.824
credit-g	0.778	0.779	0.791	0.796	0.796	0.794	0.785	0.788
guillermo	0.891	0.897	0.826	0.914	0.932	0.865	0.912	0.919
jasmine	0.888	0.887	0.886	0.886	0.880	0.891	0.886	0.887
kc1	0.843	0.829	0.844	0.840	0.831	0.852	0.824	0.841
kddcup09_appetency	0.753	0.837	0.831	0.849	0.851	0.818	0.837	0.825
kr-vs-kp	1.000	1.000	0.999	1.000	1.000	1.000	1.000	0.961
minibooNE	0.981	0.987	0.982	0.989	0.988	0.982	0.987	0.987
nomao	0.994	0.996	0.995	0.997	0.997	0.995	0.997	0.997
numerai28_6	0.531	0.531	0.528	0.531	0.531	0.530	0.531	0.528
phoneme	0.965	0.968	0.969	0.969	0.966	0.971	0.967	0.972
riccardo	0.998	1.000	0.998	1.000	1.000	0.999	1.000	1.000
sylvine	0.988	0.990	0.992	0.990	0.988	0.993	0.992	0.991

AutoML Benchmark Results (Logloss)
Task	FEDOT	H2O	TPOT	AutoGluon(B)	LightAutoML	GAMA(B)	MLJAR(P)	FLAML
car	0.011	0.001	0.788	0.002	0.001	0.022	0.010	0.002
cnae-9	0.211	0.200	0.146	0.126	0.152	0.126	0.323	0.164
connect-4	0.404	0.311	0.392	0.295	0.335	0.417	0.342	0.340
covertype	0.164	0.253	0.696	0.057	0.082	0.526	0.105	0.068
dilbert	0.040	0.065	0.150	0.014	0.033	0.176	0.030	0.024
fabert	0.859	0.746	0.886	0.683	0.768	0.763	0.771	0.766
fashion-mnist	0.388	0.283	0.431	0.221	0.248	0.439	0.259	0.253
helena	2.963	2.791	2.951	2.467	2.555	2.802	2.653	2.617
jannis	0.753	0.669	0.734	0.650	0.666	0.732	0.672	0.674
jungle_chess_2pcs_raw	0.349	0.136	1.766	0.012	0.145	0.243	0.198	0.210
mfeat-factors	0.089	0.096	0.135	0.071	0.080	0.077	0.096	0.092
robert	1.745	1.423	1.956	1.304	1.283	1.710	1.417	1.382
segment	0.062	0.061	0.075	0.052	0.061	0.067	0.059	0.067
shuttle	0.001	0.000	0.001	0.000	0.001	0.001	0.000	0.000
vehicle	0.354	0.351	0.417	0.312	0.389	0.378	0.349	0.439
volkert	1.040	0.844	1.013	0.672	0.815	1.102	0.808	0.795

[1] Gijsbers P. et al. AMLB: an AutoML benchmark //Journal of Machine Learning Research. – 2024. – Т. 25. – №. 101. – С. 1-65.