Stairway to Fairness: Connecting Group and Individual Fairness

Theresia Veronika Rampisela; Maria Maistro; Tuukka Ruotsalo; Falk Scholer; Christina Lioma

arXiv:2508.21334·cs.IR·September 1, 2025

Stairway to Fairness: Connecting Group and Individual Fairness

Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Falk Scholer, Christina Lioma

PDF

TL;DR

This paper investigates the relationship between group and individual fairness in recommender systems, revealing that optimizing for one can lead to unfairness in the other, and provides a comprehensive evaluation framework.

Contribution

It introduces a systematic comparison of fairness measures for group and individual fairness, highlighting their trade-offs and interactions.

Findings

01

High group fairness can lead to individual unfairness.

02

Evaluation measures for both fairness types are comparable.

03

Insights aid practitioners in balancing fairness objectives.

Abstract

Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their…

Tables2

Table 1. Table 1. Preprocessed dataset statistics. n G a n_{G_{a}} is the number of groups for sensitive attribute a a . We exclude empty groups.

	ML-1M (Harper and Konstan, 2015)	JobRec (Hamner et al., 2012)	LFM-1B (Schedl, 2016)
#interaction (all splits)	467,218	210,921	15,024,267
#item (all splits)	3,030	19,912	51,204
#user (test set)	620	523	16,611
sensitive attr. #1 ( $n_{G_{1}}$ )	gender (2)	degree (3)	gender (2)
sensitive attr. #2 ( $n_{G_{2}}$ )	age (3)	years of experience (3)	age (3)
sensitive attr. #3 ( $n_{G_{3}}$ )	occupation (2)	major (6)	country (5)
#intersectional groups	12	36	29
min–max subgroup size	2–279	1–70	1–4,260
median subgroup size	31	7	59

Table 2. Table 2. Effectiveness ( Eff ) and fairness ( Fair ) scores at cut-off k = 10 k=10 for intersectional groups ( Grp. ) and for individuals ( Ind. ) of LLMRecs with non-sensitive (NS) and sensitive (S) prompts. All Fair scores are computed with NDCG. All measures range in [0,1], except for the Grp. measures below the grey lines. The best Eff / Fair scores are bolded. Darker green marks scores closer to the best Eff / Fair per measure. ↑ ⁣ / ⁣ ↓ \uparrow/\downarrow means the higher/lower the better.

	LLMRec	GLM-4-9B		Llama-3.1-8B		Ministral-8B		Qwen2.5-7B
	prompt type	NS	S	NS	S	NS	S	NS	S
		ML-1M
Eff	$↑$ HR	\cellcolor[HTML]54A075 0.377	\cellcolor[HTML]6CAD88 0.358	\cellcolor[HTML]ECF2EE 0.260	\cellcolor[HTML]E1ECE5 0.269	\cellcolor[HTML]81B899 0.342	\cellcolor[HTML]84BA9B 0.340	\cellcolor[HTML]66AA83 0.363	\cellcolor[HTML]5CA47B 0.371
	$↑$ MRR	\cellcolor[HTML]54A075 0.189	\cellcolor[HTML]6EAE89 0.174	\cellcolor[HTML]ECF2EE 0.101	\cellcolor[HTML]D8E7DE 0.113	\cellcolor[HTML]88BC9E 0.159	\cellcolor[HTML]6FAF8B 0.173	\cellcolor[HTML]7DB696 0.165	\cellcolor[HTML]64A881 0.180
	$↑$ NDCG	\cellcolor[HTML]54A075 0.231	\cellcolor[HTML]6FAE8A 0.215	\cellcolor[HTML]ECF2EE 0.140	\cellcolor[HTML]DFEBE4 0.148	\cellcolor[HTML]88BC9E 0.200	\cellcolor[HTML]7AB593 0.208	\cellcolor[HTML]7CB695 0.207	\cellcolor[HTML]65A982 0.221
Fair (Grp.)	$↑$ Min	\cellcolor[HTML]54A075 0.166	\cellcolor[HTML]7DB696 0.137	\cellcolor[HTML]C6DED0 0.086	\cellcolor[HTML]A6CCB7 0.108	\cellcolor[HTML]ECF2EE 0.059	\cellcolor[HTML]A8CDB8 0.107	\cellcolor[HTML]E0EBE4 0.068	\cellcolor[HTML]BFDACA 0.091
	$↓$ Range	\cellcolor[HTML]54A075 0.188	\cellcolor[HTML]5EA57D 0.208	\cellcolor[HTML]A6CCB6 0.356	\cellcolor[HTML]B0D2BE 0.376	\cellcolor[HTML]86BB9C 0.290	\cellcolor[HTML]ECF2EE 0.500	\cellcolor[HTML]98C5AB 0.328	\cellcolor[HTML]7EB796 0.274
	$↓$ SD	\cellcolor[HTML]54A075 0.055	\cellcolor[HTML]61A77F 0.061	\cellcolor[HTML]9CC7AE 0.088	\cellcolor[HTML]9CC7AE 0.088	\cellcolor[HTML]95C3A9 0.085	\cellcolor[HTML]ECF2EE 0.125	\cellcolor[HTML]A6CCB7 0.093	\cellcolor[HTML]84BA9B 0.077
	$↓$ MAD	\cellcolor[HTML]54A075 0.067	\cellcolor[HTML]66AA83 0.076	\cellcolor[HTML]84BA9B 0.091	\cellcolor[HTML]7DB695 0.087	\cellcolor[HTML]95C3A9 0.099	\cellcolor[HTML]ECF2EE 0.142	\cellcolor[HTML]AFD1BE 0.112	\cellcolor[HTML]83B99A 0.090
	$↓$ Gini	\cellcolor[HTML]54A075 0.130	\cellcolor[HTML]74B18F 0.161	\cellcolor[HTML]D7E7DE 0.255	\cellcolor[HTML]B9D6C5 0.226	\cellcolor[HTML]D0E3D8 0.248	\cellcolor[HTML]ECF2EE 0.275	\cellcolor[HTML]D7E7DE 0.255	\cellcolor[HTML]98C5AB 0.195
	$↓$ Atk	\cellcolor[HTML]54A075 0.015	\cellcolor[HTML]68AB85 0.020	\cellcolor[HTML]A8CDB8 0.036	\cellcolor[HTML]7CB695 0.025	\cellcolor[HTML]9CC7AE 0.033	\cellcolor[HTML]DDEAE2 0.049	\cellcolor[HTML]ECF2EE 0.053	\cellcolor[HTML]B4D4C2 0.039
\arrayrulecolorgray!50\arrayrulecolorblack	$↓$ CV	\cellcolor[HTML]54A075 0.233	\cellcolor[HTML]6EAE89 0.285	\cellcolor[HTML]ECF2EE 0.540	\cellcolor[HTML]DAE8DF 0.502	\cellcolor[HTML]C7DED1 0.465	\cellcolor[HTML]E5EEE8 0.525	\cellcolor[HTML]C5DDCF 0.460	\cellcolor[HTML]96C3A9 0.365
	$↓$ FStat	\cellcolor[HTML]54A075 0.468	\cellcolor[HTML]74B18E 0.714	\cellcolor[HTML]84BA9B 0.841	\cellcolor[HTML]6CAD88 0.654	\cellcolor[HTML]7BB594 0.767	\cellcolor[HTML]BDD9C9 1.278	\cellcolor[HTML]ECF2EE 1.645	\cellcolor[HTML]B5D5C2 1.220
	$↓$ KL	\cellcolor[HTML]84BA9B 1.121	\cellcolor[HTML]87BC9E 1.138	\cellcolor[HTML]ECF2EE 1.674	\cellcolor[HTML]E6EFE9 1.640	\cellcolor[HTML]54A075 0.866	\cellcolor[HTML]ECF2EE 1.671	\cellcolor[HTML]93C2A7 1.198	\cellcolor[HTML]79B492 1.063
	$↓$ GCE	\cellcolor[HTML]54A075 0.028	\cellcolor[HTML]54A075 0.050	\cellcolor[HTML]54A075 0.112	\cellcolor[HTML]54A075 0.104	\cellcolor[HTML]ECF2EE 659.844	\cellcolor[HTML]ECF2EE 659.741	\cellcolor[HTML]54A075 0.239	\cellcolor[HTML]54A075 0.198
Fair (Ind.)	$↓$ SD	\cellcolor[HTML]ECF2EE 0.330	\cellcolor[HTML]D6E6DD 0.320	\cellcolor[HTML]54A075 0.262	\cellcolor[HTML]6AAC87 0.272	\cellcolor[HTML]BDD9C9 0.309	\cellcolor[HTML]DAE8E0 0.322	\cellcolor[HTML]BBD8C7 0.308	\cellcolor[HTML]DFEBE4 0.324
	$↓$ Gini	\cellcolor[HTML]54A075 0.705	\cellcolor[HTML]6EAE89 0.721	\cellcolor[HTML]ECF2EE 0.799	\cellcolor[HTML]E3EDE7 0.793	\cellcolor[HTML]86BB9D 0.736	\cellcolor[HTML]86BB9D 0.736	\cellcolor[HTML]6EAE89 0.721	\cellcolor[HTML]64A982 0.715
	$↓$ Atk	\cellcolor[HTML]54A075 0.636	\cellcolor[HTML]6DAE89 0.655	\cellcolor[HTML]ECF2EE 0.751	\cellcolor[HTML]E0ECE5 0.742	\cellcolor[HTML]84BA9B 0.672	\cellcolor[HTML]85BA9C 0.673	\cellcolor[HTML]69AB86 0.652	\cellcolor[HTML]5EA57D 0.644
		JobRec
Eff	$↑$ HR	\cellcolor[HTML]7AB493 0.054	\cellcolor[HTML]C3DCCD 0.033	\cellcolor[HTML]9DC7AF 0.044	\cellcolor[HTML]ECF2EE 0.021	\cellcolor[HTML]6FAF8B 0.057	\cellcolor[HTML]9DC7AF 0.044	\cellcolor[HTML]54A075 0.065	\cellcolor[HTML]6FAF8B 0.057
	$↑$ MRR	\cellcolor[HTML]7FB797 0.037	\cellcolor[HTML]B6D5C3 0.023	\cellcolor[HTML]CDE1D5 0.017	\cellcolor[HTML]ECF2EE 0.009	\cellcolor[HTML]54A075 0.048	\cellcolor[HTML]8EC0A4 0.033	\cellcolor[HTML]5CA47B 0.046	\cellcolor[HTML]5FA67E 0.045
	$↑$ NDCG	\cellcolor[HTML]78B392 0.041	\cellcolor[HTML]B8D6C5 0.025	\cellcolor[HTML]BCD8C8 0.024	\cellcolor[HTML]ECF2EE 0.012	\cellcolor[HTML]54A075 0.050	\cellcolor[HTML]8CBEA2 0.036	\cellcolor[HTML]54A075 0.050	\cellcolor[HTML]5CA47B 0.048
Fair (Grp.)	$↑$ Min	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000	\cellcolor[HTML]ECF2EE 0.000
	$↓$ Range	\cellcolor[HTML]ECF2EE 0.500	\cellcolor[HTML]54A075 0.083	\cellcolor[HTML]ECF2EE 0.500	\cellcolor[HTML]ECF2EE 0.500	\cellcolor[HTML]AFD1BE 0.333	\cellcolor[HTML]74B18E 0.170	\cellcolor[HTML]AFD1BE 0.333	\cellcolor[HTML]AFD1BE 0.333
	$↓$ SD	\cellcolor[HTML]ECF2EE 0.093	\cellcolor[HTML]54A075 0.023	\cellcolor[HTML]DDEAE2 0.086	\cellcolor[HTML]D4E5DB 0.082	\cellcolor[HTML]C3DCCD 0.074	\cellcolor[HTML]7DB696 0.042	\cellcolor[HTML]BFDACA 0.072	\cellcolor[HTML]BCD8C8 0.071
	$↓$ MAD	\cellcolor[HTML]ECF2EE 0.066	\cellcolor[HTML]54A075 0.019	\cellcolor[HTML]C2DCCD 0.053	\cellcolor[HTML]88BC9E 0.035	\cellcolor[HTML]E9F1EC 0.065	\cellcolor[HTML]88BC9E 0.035	\cellcolor[HTML]E6EFE9 0.064	\cellcolor[HTML]CCE1D5 0.056
	$↓$ Gini	\cellcolor[HTML]97C4AB 0.828	\cellcolor[HTML]9CC7AE 0.834	\cellcolor[HTML]ADD0BC 0.857	\cellcolor[HTML]ECF2EE 0.939	\cellcolor[HTML]61A77F 0.757	\cellcolor[HTML]7AB593 0.790	\cellcolor[HTML]54A075 0.740	\cellcolor[HTML]89BC9F 0.809
	$↓$ Atk	\cellcolor[HTML]73B18E 0.547	\cellcolor[HTML]66AA84 0.518	\cellcolor[HTML]AACEB9 0.677	\cellcolor[HTML]ECF2EE 0.834	\cellcolor[HTML]68AB85 0.522	\cellcolor[HTML]5FA67E 0.499	\cellcolor[HTML]54A075 0.473	\cellcolor[HTML]5FA67E 0.500
\arrayrulecolorgray!50\arrayrulecolorblack	$↓$ CV	\cellcolor[HTML]78B492 2.396	\cellcolor[HTML]69AB86 2.102	\cellcolor[HTML]93C2A7 2.886	\cellcolor[HTML]ECF2EE 4.549	\cellcolor[HTML]56A177 1.760	\cellcolor[HTML]5FA67E 1.928	\cellcolor[HTML]54A075 1.709	\cellcolor[HTML]69AB86 2.107
	$↓$ FStat	\cellcolor[HTML]83B99A 1.035	\cellcolor[HTML]54A075 0.546	\cellcolor[HTML]BAD7C6 1.613	\cellcolor[HTML]ECF2EE 2.137	\cellcolor[HTML]80B898 1.005	\cellcolor[HTML]5AA37A 0.611	\cellcolor[HTML]77B391 0.908	\cellcolor[HTML]78B492 0.928
	$↓$ KL	\cellcolor[HTML]91C1A6 3.218	\cellcolor[HTML]54A075 1.428	\cellcolor[HTML]ACCFBB 3.979	\cellcolor[HTML]ECF2EE 5.861	\cellcolor[HTML]64A881 1.881	\cellcolor[HTML]5FA67E 1.754	\cellcolor[HTML]6AAC87 2.069	\cellcolor[HTML]71B08C 2.288
	$↓$ GCE	\cellcolor[HTML]80B898 1685.926	\cellcolor[HTML]D7E7DD 1979.103	\cellcolor[HTML]80B898 1685.994	\cellcolor[HTML]ECF2EE 2052.498	\cellcolor[HTML]69AC86 1612.574	\cellcolor[HTML]95C3A9 1759.177	\cellcolor[HTML]54A075 1539.278	\cellcolor[HTML]95C3A9 1759.185
Fair (Ind.)	$↓$ SD	\cellcolor[HTML]C9DFD2 0.183	\cellcolor[HTML]9CC7AE 0.147	\cellcolor[HTML]7BB594 0.121	\cellcolor[HTML]54A075 0.090	\cellcolor[HTML]ECF2EE 0.211	\cellcolor[HTML]BED9C9 0.174	\cellcolor[HTML]E4EEE8 0.204	\cellcolor[HTML]E3EDE7 0.203
	$↓$ Gini	\cellcolor[HTML]78B392 0.956	\cellcolor[HTML]C0DACB 0.974	\cellcolor[HTML]A0C9B2 0.966	\cellcolor[HTML]ECF2EE 0.985	\cellcolor[HTML]5CA47B 0.949	\cellcolor[HTML]94C2A8 0.963	\cellcolor[HTML]54A075 0.947	\cellcolor[HTML]64A881 0.951
	$↓$ Atk	\cellcolor[HTML]7BB594 0.948	\cellcolor[HTML]C5DDCF 0.969	\cellcolor[HTML]9FC8B0 0.958	\cellcolor[HTML]ECF2EE 0.980	\cellcolor[HTML]6CAD88 0.944	\cellcolor[HTML]9BC6AE 0.957	\cellcolor[HTML]54A075 0.937	\cellcolor[HTML]6CAD88 0.944
		LFM-1B
Eff	$↑$ HR	\cellcolor[HTML]55A176 0.658	\cellcolor[HTML]54A075 0.661	\cellcolor[HTML]6BAC87 0.609	\cellcolor[HTML]67AA84 0.618	\cellcolor[HTML]B1D2BF 0.451	\cellcolor[HTML]ECF2EE 0.317	\cellcolor[HTML]D2E4DA 0.375	\cellcolor[HTML]ECF2EE 0.317
	$↑$ MRR	\cellcolor[HTML]54A075 0.409	\cellcolor[HTML]55A075 0.408	\cellcolor[HTML]7AB493 0.347	\cellcolor[HTML]74B18E 0.357	\cellcolor[HTML]ACCFBB 0.266	\cellcolor[HTML]E4EEE8 0.174	\cellcolor[HTML]D4E5DB 0.199	\cellcolor[HTML]ECF2EE 0.160
	$↑$ NDCG	\cellcolor[HTML]54A075 0.462	\cellcolor[HTML]54A075 0.463	\cellcolor[HTML]75B28F 0.406	\cellcolor[HTML]6FAF8B 0.415	\cellcolor[HTML]ACCFBB 0.310	\cellcolor[HTML]E7EFEA 0.208	\cellcolor[HTML]D4E5DB 0.241	\cellcolor[HTML]ECF2EE 0.198
Fair (Grp.)	$↑$ Min	\cellcolor[HTML]69AC86 0.240	\cellcolor[HTML]54A075 0.273	\cellcolor[HTML]5AA37A 0.264	\cellcolor[HTML]59A379 0.265	\cellcolor[HTML]C2DBCC 0.107	\cellcolor[HTML]E3EDE7 0.058	\cellcolor[HTML]CDE1D5 0.090	\cellcolor[HTML]ECF2EE 0.043
	$↓$ Range	\cellcolor[HTML]92C1A6 0.604	\cellcolor[HTML]92C1A6 0.604	\cellcolor[HTML]D2E4D9 0.884	\cellcolor[HTML]DDEAE2 0.931	\cellcolor[HTML]EBF1ED 0.993	\cellcolor[HTML]ECF2EE 1.000	\cellcolor[HTML]80B898 0.525	\cellcolor[HTML]54A075 0.331
	$↓$ SD	\cellcolor[HTML]B8D6C5 0.149	\cellcolor[HTML]97C4AB 0.130	\cellcolor[HTML]C8DED1 0.158	\cellcolor[HTML]A9CEB8 0.140	\cellcolor[HTML]ECF2EE 0.179	\cellcolor[HTML]EBF1ED 0.178	\cellcolor[HTML]81B899 0.117	\cellcolor[HTML]54A075 0.091
	$↓$ MAD	\cellcolor[HTML]A5CBB5 0.138	\cellcolor[HTML]7DB696 0.121	\cellcolor[HTML]D3E5DA 0.158	\cellcolor[HTML]92C1A6 0.130	\cellcolor[HTML]ECF2EE 0.169	\cellcolor[HTML]D1E3D8 0.157	\cellcolor[HTML]87BB9D 0.125	\cellcolor[HTML]54A075 0.103
	$↓$ Gini	\cellcolor[HTML]64A881 0.162	\cellcolor[HTML]54A075 0.139	\cellcolor[HTML]73B18E 0.184	\cellcolor[HTML]63A881 0.161	\cellcolor[HTML]B9D7C6 0.285	\cellcolor[HTML]ECF2EE 0.358	\cellcolor[HTML]A9CEB9 0.262	\cellcolor[HTML]C2DCCD 0.298
	$↓$ Atk	\cellcolor[HTML]54A075 0.002	\cellcolor[HTML]54A075 0.002	\cellcolor[HTML]5EA57D 0.003	\cellcolor[HTML]68AB85 0.004	\cellcolor[HTML]7DB695 0.006	\cellcolor[HTML]C4DCCE 0.013	\cellcolor[HTML]87BB9D 0.007	\cellcolor[HTML]ECF2EE 0.017
\arrayrulecolorgray!50\arrayrulecolorblack	$↓$ CV	\cellcolor[HTML]63A881 0.363	\cellcolor[HTML]54A075 0.309	\cellcolor[HTML]69AB86 0.383	\cellcolor[HTML]61A77F 0.356	\cellcolor[HTML]AFD1BD 0.625	\cellcolor[HTML]ECF2EE 0.841	\cellcolor[HTML]8DBFA3 0.510	\cellcolor[HTML]97C4AB 0.545
	$↓$ FStat	\cellcolor[HTML]6AAC87 2.188	\cellcolor[HTML]54A075 1.809	\cellcolor[HTML]92C1A6 2.859	\cellcolor[HTML]CCE1D5 3.841	\cellcolor[HTML]91C1A6 2.855	\cellcolor[HTML]B1D2BF 3.382	\cellcolor[HTML]73B18E 2.337	\cellcolor[HTML]ECF2EE 4.385
	$↓$ KL	\cellcolor[HTML]7EB796 2.645	\cellcolor[HTML]90C0A5 2.783	\cellcolor[HTML]EBF1ED 3.497	\cellcolor[HTML]C4DCCE 3.189	\cellcolor[HTML]ADD0BC 3.011	\cellcolor[HTML]ECF2EE 3.508	\cellcolor[HTML]9CC7AE 2.878	\cellcolor[HTML]54A075 2.318
	$↓$ GCE	\cellcolor[HTML]A0C9B1 338.842	\cellcolor[HTML]7AB493 225.899	\cellcolor[HTML]54A075 112.987	\cellcolor[HTML]54A075 112.974	\cellcolor[HTML]C7DED0 451.826	\cellcolor[HTML]C7DED0 451.875	\cellcolor[HTML]C7DED0 451.822	\cellcolor[HTML]ECF2EE 564.765
Fair (Ind.)	$↓$ SD	\cellcolor[HTML]ECF2EE 0.380	\cellcolor[HTML]E7F0EA 0.378	\cellcolor[HTML]D3E5DA 0.370	\cellcolor[HTML]D5E6DC 0.371	\cellcolor[HTML]E5EEE8 0.377	\cellcolor[HTML]77B391 0.334	\cellcolor[HTML]8BBEA1 0.342	\cellcolor[HTML]54A075 0.320
	$↓$ Gini	\cellcolor[HTML]54A075 0.462	\cellcolor[HTML]54A075 0.461	\cellcolor[HTML]6DAE89 0.510	\cellcolor[HTML]69AB86 0.502	\cellcolor[HTML]B0D2BF 0.638	\cellcolor[HTML]EBF1ED 0.750	\cellcolor[HTML]D2E4DA 0.703	\cellcolor[HTML]ECF2EE 0.753
	$↓$ Atk	\cellcolor[HTML]55A176 0.361	\cellcolor[HTML]54A075 0.358	\cellcolor[HTML]6BAC87 0.409	\cellcolor[HTML]66AA84 0.400	\cellcolor[HTML]B0D2BF 0.563	\cellcolor[HTML]ECF2EE 0.694	\cellcolor[HTML]D2E4DA 0.638	\cellcolor[HTML]ECF2EE 0.695

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setcctype

by-sa

Stairway to Fairness: Connecting Group and Individual Fairness

Theresia Veronika Rampisela

0000-0003-1233-7690 University of CopenhagenCopenhagenDenmark

[email protected]

,

Maria Maistro

0000-0002-7001-4817 University of CopenhagenCopenhagenDenmark

[email protected]

,

Tuukka Ruotsalo

0000-0002-2203-4928 University of CopenhagenCopenhagenDenmark

LUT UniversityLahtiFinland

[email protected]

,

Falk Scholer

0000-0001-9094-0810 RMIT UniversityMelbourneAustralia

[email protected]

and

Christina Lioma

0000-0003-2600-2701 University of CopenhagenCopenhagenDenmark

[email protected]

(2025)

Abstract.

Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: https://github.com/theresiavr/stairway-to-fairness.

group fairness, individual fairness, fairness evaluation

††journalyear: 2025††copyright: cc††conference: Proceedings of the Nineteenth ACM Conference on Recommender Systems; September 22–26, 2025; Prague, Czech Republic††booktitle: Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys ’25), September 22–26, 2025, Prague, Czech Republic††doi: 10.1145/3705328.3748031††isbn: 979-8-4007-1364-4/2025/09††ccs: Information systems Evaluation of retrieval results††ccs: General and reference Evaluation††ccs: Information systems Recommender systems

1. Introduction

With recent legislations that mandate responsible artificial intelligence development, Recommender System (RS) fairness evaluation has become increasingly important to ensure that users are not systematically disadvantaged. Fairness in RSs can be evaluated for groups and for individuals. Group fairness typically refers to having equitable outcome across groups (e.g., similar effectiveness between groups of users (Ekstrand et al., 2018)), while individual fairness is commonly defined as treating similar users/items equally (Wang et al., 2023) (e.g., similar effectiveness for all users (Wu et al., 2021)). Conceptually, prior work (Li et al., 2023; Do, 2023; Singh and Joachims, 2018) discusses how RSs can be fair to groups and at the same time unfair to individuals, or vice versa, but no work has empirically studied how this practically occurs in RS fairness evaluation.

Prior work either: (i) evaluates fairness exclusively for groups or individuals (Deldjoo et al., 2024); or (ii) evaluates both, but with two different families of measures (Ferraro et al., 2024; Pastor and Bonchi, 2024; Pellegrini et al., 2023) or for two fairness subjects/objectives (Rastegarpanah et al., 2019; Wu et al., 2021). Evaluating group and individual fairness with different families of measures makes comparison difficult, as the measure scores may differ in sensitivity, or in theoretical and empirical ranges (Rampisela et al., 2024b, a; Schumacher et al., 2025). Likewise, it is not possible to properly compare group and individual fairness when each is evaluated for a distinct fairness objective, e.g., recommendation effectiveness disparity across individual users vs. exposure disparity between item groups (Wu et al., 2021). To address this gap, we evaluate user-side group and individual fairness with the same families of measures that can quantify both. An example of such measures is the Gini Index (Gini) (Gini, 1912). In Fig. 1, we evaluate group and individual fairness with Gini on real data and exemplify how an RS can be very fair towards user groups and at the same time much more unfair towards individual users.

In this paper, we study the relationship between evaluation measures of user-side group and individual fairness. This work is the first empirical study that compares the 9 existing user-side fairness evaluation measures for groups with those for individuals. We ask the following research questions (RQs):

RQ1

To what extent do group and individual fairness evaluation measures differ in their conclusions? 2. RQ2

For the same family of measures, how different are the group and individual fairness scores? 3. RQ3

How do different ways of grouping users affect between- and within-group fairness? 4. RQ4

How do between- and within-group fairness relate to individual fairness?

Our results show that group fairness measures often hide unfairness within groups and between individuals, highlighting the importance of evaluating fairness beyond the between-group level.

2. Methodology

We compare evaluation measures of user-side individual and group fairness in RSs, considering multiple ways of grouping users.

Datasets

To enable group fairness analysis, three datasets with $\geq 3$ user profile features are selected (see Tab. 1 for statistics).

ML-1M (Harper and Konstan, 2015) has 1,000,029 movie ratings (1–5) from 6K users. Users with no/unspecified self-reported gender, age, or occupation are removed, and we exclude users under 18 years to avoid processing the data of minors. We focus on recommending preferred movies, so ratings ¡3 are discarded, and the levels 4 and 5 are mapped to 1.

JobRec (Hamner et al., 2012) has 1.6M job applications from 321K users. Given a user’s application history, we focus on recommending job titles that may suit them, keeping only users with information for degree, major, and years of experience. Users with more than 60 years of experience are removed (as this likely indicates erroneous entries).

LFM-1B (Schedl, 2016) has $1{,}088{,}161{,}692$ music playcounts, from $\sim$ 120K users. We focus on recommending new track artists for a user to listen to, other than the ones they have listened to in the past, using the dataset after deduplication based on the artist, with 65M interactions (provided by RecBole (Zhao et al., 2021)). The deduplication summarises the total playcount per artist and keeps the last event timestamp. Users without countries, age, or gender information are removed, as are minors (as in ML-1M), and users with age ¿100 years (as this likely indicates erroneous entries).

Items without name/title are removed from all datasets. To reduce data sparsity, which may affect LLMRec performance (Jiang et al., 2025), we keep users and items with $\geq$ 5 interactions (5-core filtering) for ML-1M and JobRec. We apply 50-core filtering (Makhneva et al., 2023; Wen et al., 2023; Zhao et al., 2023) to LFM-1B, as it is highly sparse with 5-core filtering. The data is temporally split for train/val/test with a ratio of 3:1:1 using a global timeline (Meng et al., 2020). From all splits, users and items with $\leq t$ interactions in the train set are removed. A high $t$ can result in very few unique users in the test set, so we choose $t$ such that at least 500 test users remain. For ML-1M and LFM-1B, we set $t=5$ (Xu et al., 2024). For JobRec, we use $t=2$ . We remove users/items in the val and test sets that are not in the train set.

User grouping

To study group fairness, we cluster users based on their sensitive attributes (see Tab. 1 and App. A.1 in the code repository). Users cannot belong to two groups at the same time, e.g., age¡50 and age $\geq$ 50.

For ML-1M, we use gender, age, and occupation as sensitive attributes. Gender is used as is. Age is grouped into: 18–24 years, 25–49 years, and $\geq$ 50 years (Office for National Statistics, 2023). User occupation is grouped into: non-working (student (U.S. Bureau of Labor Statistics, 2024; Eurostat, 2018), homemaker, retired, and unemployed) and working (14 occupations ranging from farmer to executive (U.S. Bureau of Labor Statistics, 2023)).

For JobRec, we consider the user’s academic degree, years of working experience, and study major as the sensitive attributes. Degree is grouped into: high school, college (associate or vocational degree), and university (bachelor’s, master’s, and PhD). Years of experience are grouped into: $\leq$ 5 years, ¿5–10 years, and ¿10 years. We group study majors into six fields of study, as per Xu et al. (Xu et al., 2024), using manual annotation and fuzzy string matching.111Details are provided in App. A.1 in the code repository.

For LFM-1B, the sensitive attributes are gender, age, and country. Gender and age are processed as for ML-1M, and the user’s country is mapped to the continent.222We use the country-continent mapping from https://gist.github.com/achuhunkin/6cb1cbceb23395300aa209aad09e6e5d, and manually group transcontinental countries. Users from the North/South Americas are grouped together with Antarctica (Xu et al., 2024; Gómez et al., 2024).

LLM-based Recommenders

Recent work has utilised Large Language Models (LLMs) as recommenders (LLMRecs), with promising results (Hou et al., 2024).333We also experiment with two collaborative filtering RSs, i.e., UserKNN and NeuMF. See also Footnote 10. Unlike collaborative filtering models, LLMRecs can easily handle fine-grained user attributes (e.g., users’ study major, which can be important for job recommendation) with their world knowledge, although including sensitive attributes in the prompt can impact effectiveness and fairness (Deldjoo and Di Noia, 2025; Xu et al., 2024; Zhang et al., 2023). We therefore study the effectiveness and fairness of LLMRecs under few-shot learning. To ensure comparable performance, we use four open-source, similar-sized LLMs released in July–Nov’24: Llama-3.1-8B-Instruct (Grattafiori et al., 2024), Qwen2.5-7B-Instruct (Team, 2024), GLM-4-9B-chat (GLM, 2024), and Ministral-8B-Instruct-2410 (team, 2024). The temperature is fixed at 0 for each LLM to obtain deterministic output.

The LLMs are prompted using in-context learning (ICL) strategy (Hou et al., 2024), as this has been shown to outperform sequential and recency-focused prompting.444The full prompt templates and examples are provided in App. A.2. We only prompt for users that exist in the test set, as otherwise it is not possible to evaluate the recommendation effectiveness. In the prompt, we provide the users’ train items as the interaction history and the val items as the few-shot samples. To guide the recommendation generation with the ICL strategy, val items are used as examples of what should be recommended to a user, considering their historical interactions. A maximum of 10 most recent train and val items each are provided, as having too many items in the prompt may reduce effectiveness (Jiang et al., 2025; Hou et al., 2024). To avoid inflated performance, we do not prompt with a sampled candidate item list (Krichene and Rendle, 2022). Instead, we add restriction in the prompt to narrow down the item search space, e.g., for ML-1M, the movies should be between certain years, based on the movie release year in the metadata file. For LFM-1B, the prompt also includes the playcount, which is important in music recommendation (Gómez et al., 2024).

Based on the inclusion/exclusion of user sensitive attributes, we create two prompt types (Deldjoo and Di Noia, 2025; Zhang et al., 2023; Tommasel, 2024; Deldjoo and Nazary, 2024): Sensitive (S), which has both interaction history and all three sensitive attributes, and Non-Sensitive (NS), which has only the interaction history.

To evaluate LLMRecs, we perform fuzzy string matching between the list of recommended items and item names in the test set (Deldjoo, 2024; Jiang et al., 2025; He et al., 2023; Liang et al., 2024; Palma et al., 2024), by using the TF-IDF (Jones, 1972) of the items’ character-based n-gram (Lian et al., 2024).555Metrics based on n-gram have been used to evaluate the performance of (conversational) RSs (Ravaut et al., 2024). If the item name similarity exceeds a pre-set threshold ( $\geq 0.75$ ), we count it as a match (i.e., the LLMRec successfully recommends an item that exists in a user’s test set).666To our knowledge, no existing work has evaluated the effect of various similarity thresholds for this context. Our LLMRecs experiments are carried out with vllm (Kwon et al., 2023) and RecLM-eval (Lian et al., 2024).

Evaluation

Recommendation effectiveness (Eff) and fairness (Fair) are measured at $k=10$ for all LLMRecs. For Eff, the mean Hit Rate (HR), MRR, P@ $k$ (P), and NDCG@ $k$ (Järvelin and Kekäläinen, 2002) are computed over all users. Group and individual Fair measures are computed in two steps: first, computing an Eff score per user/group as a ‘base score’; and second, aggregating the ‘base score’ between users/groups with a Fair measure. P and NDCG are used as base scores to represent set- and rank-based measures.

Group fairness. We compute all existing fairness measures for two or more user groups in RSs,777Measures that can only be used for exactly two groups are excluded. that are published up to March 2025: Average scores of the worst 25% groups (Min (Wang et al., 2024)), Range (Liu et al., 2024), SD (Zhang et al., 2023; Liu et al., 2024), MAD (Fu et al., 2020), Gini (Pastor and Bonchi, 2024; Ferraro et al., 2024; Ghosh et al., 2024), CV (Zhu et al., 2020), FStat (Wan et al., 2020), KL (Amigó et al., 2023), and GCE (Deldjoo et al., 2021, 2019). We also compute the Atkinson Index (Atk (Atkinson, 1970)), an income inequality measure that considers within-group variations.888This measure can be transformed into Generalised Entropy (Speicher et al., 2018; Shorrocks, 1980). The between- and within-group fairness version of the measures are denoted as $\cdot_{\text{b-group}}$ and $\cdot_{\text{w-group}}$ respectively.999The term group fairness refers to between-group fairness; the latter is used when we compare fairness between and within groups. We provide the measure formulations and technical details in App. A.3.

Individual fairness. Fairness for individual users is quantified with SD (Patro et al., 2020), Gini (Leonhardt et al., 2018), and Atk. The subscript $\cdot$${}_{\text{ind}}$ indicates the individual fairness version of the measure. SD and Gini are the only Fair measures that have been used for both individual and group user fairness, while Atk ${}_{\text{ind}}$ can be decomposed into between- and within-group fairness with no residuals (Blackorby et al., 1999; Dayioğlu and Başlevent, 2006; Bourguignon, 1979). While other group Fair measures can also be used to measure individual fairness, their scores may not be informative, e.g., Min may be zero for most models, as having most users scoring P=0 or NDCG=0 is common.

3. Empirical Analysis

Evaluation of all LLMRecs

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Amigó et al. (2023) Enrique Amigó, Yashar Deldjoo, Stefano Mizzaro, and Alejandro Bellogín. 2023. A unifying and general account of fairness measurement in recommender systems. Information Processing & Management 60, 1 (1 2023), 103115. https://doi.org/10.1016/J.IPM.2022.103115 · doi ↗
3Atkinson (1970) Anthony B Atkinson. 1970. On the measurement of inequality. Journal of Economic Theory 2, 3 (1970), 244–263. https://doi.org/10.1016/0022-0531(70)90039-6 · doi ↗
4Blackorby et al. (1999) Charles Blackorby, Walter Bossert, and David Donaldson. 1999. Income Inequality Measurement: The Normative Approach . Springer Netherlands, Dordrecht, 133–161. https://doi.org/10.1007/978-94-011-4413-1_4 · doi ↗
5Bourguignon (1979) Francois Bourguignon. 1979. Decomposable Income Inequality Measures. Econometrica 47, 4 (1979), 901–920. http://www.jstor.org/stable/1914138
6Cohere (2025) Team Cohere. 2025. Command A: An Enterprise-Ready Large Language Model. ar Xiv:2504.00698 [cs.CL] https://arxiv.org/abs/2504.00698
7Dayioğlu and Başlevent (2006) Meltem Dayioğlu and Cem Başlevent. 2006. Imputed Rents and Regional Income Inequality in Turkey: A Subgroup Decomposition of the Atkinson Index. Regional Studies 40, 8 (2006), 889–905. https://doi.org/10.1080/00343400600984395 · doi ↗
8Deldjoo (2024) Yashar Deldjoo. 2024. Understanding Biases in Chat GPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency. ACM Trans. Recomm. Syst. (Aug. 2024). https://doi.org/10.1145/3690655 Just Accepted. · doi ↗