Enhancing the Demand for Labour survey by including skills from online   job advertisements using model-assisted calibration

Maciej Ber\k{e}sewicz; Greta Bia{\l}kowska; Krzysztof; Marcinkowski; Magdalena Ma\'slak; Piotr Opiela; Robert Pater and; Katarzyna Zadroga

arXiv:1908.06731·econ.GN·June 8, 2021

Enhancing the Demand for Labour survey by including skills from online job advertisements using model-assisted calibration

Maciej Ber\k{e}sewicz, Greta Bia{\l}kowska, Krzysztof, Marcinkowski, Magdalena Ma\'slak, Piotr Opiela, Robert Pater and, Katarzyna Zadroga

PDF

Open Access

TL;DR

This paper enhances the Demand for Labour survey by integrating online job ad skills data using model-assisted calibration, improving skill demand estimates and reducing bias compared to traditional methods.

Contribution

It introduces a novel data integration approach combining calibration with LASSO to correct online data biases without unit-level survey data.

Findings

01

LASSO-assisted calibration reduces bias and standard errors.

02

Online data overestimates interpersonal and managerial skills.

03

Under-representation of certain occupations affects skill estimates.

Abstract

In the article we describe an enhancement to the Demand for Labour (DL) survey conducted by Statistics Poland, which involves the inclusion of skills obtained from online job advertisements. The main goal is to provide estimates of the demand for skills (competences), which is missing in the DL survey. To achieve this, we apply a data integration approach combining traditional calibration with the LASSO-assisted approach to correct representation error in the online data. Faced with the lack of access to unit-level data from the DL survey, we use estimated population totals and propose a~bootstrap approach that accounts for the uncertainty of totals reported by Statistics Poland. We show that the calibration estimator assisted with LASSO outperforms traditional calibration in terms of standard errors and reduces representation bias in skills observed in online job ads. Our empirical…

Tables15

Table 1. Table 1 : Estimated total number of vacancies at the end of Q1 based on the DL survey

2011	2013	2014
71 775	42 889	52 725

Table 2. Table 2 : Sample sizes and the coding precision in the DL survey in the online job ads module

Year	2010	2011	2012	2013	2014
Day	10th Sep	28th Mar	26th Mar	25th Mar	28th Mar
Initial Sample	21 195	22 243	23 366	22 795	23 452
Final sample	20 009	20 634	21 594	20 081	21 456
DEOs	8 198	7 018	7 253	5 614	8 542
the Internet (Careerjet.pl)	11 811	13 618	14 342	14 467	12 914
Coding quality	0.72	0.89	0.96	0.96	0.96

Table 3. Table 3 : Share of skills included in job offers by data source based on pooled data for 2011, 2013 and 2014

Skill	Careerjet.pl	DEOs	Both
Artistic	15.8	2.2	11.2
Availability	21.0	2.9	14.8
Cognitive	20.8	1.5	14.3
Computer	33.2	8.9	25.0
Interpersonal	55.9	6.9	39.3
Managerial	29.2	2.0	20.0
Mathematical	0.3	0.1	0.2
Office	3.8	1.5	3.0
Physical	6.0	2.0	4.7
Self-organization	59.1	7.6	41.6
Technical	4.3	5.1	4.6

Table 4. Table 4 : Cramer’s V between skills and occupation, NACE section and province, based on the HC survey pooled data for 2011, 2013 and 2014

Skill	Occupation (2 digits)	NACE	Province
Artistic	0.22	0.11	0.05
Availability	0.15	0.14	0.05
Cognitive	0.21	0.06	0.06
Computer	0.45	0.23	0.10
Interpersonal	0.42	0.23	0.06
Managerial	0.34	0.15	0.04
Mathematical	0.05	0.02	0.03
Office	0.11	0.06	0.03
Physical	0.17	0.09	0.04
Self-organization	0.34	0.19	0.04
Technical	0.31	0.11	0.07

Table 5. Table 5 : Basic idea of data integration when variables are available at unit-level or domain-level

Data source	$𝑿$	$𝒀$	$𝒅$	$𝑻^{X}$	${\hat{𝑻}}^{X}$
Population data	✓	–	–	✓	–
Online data (A)	✓	✓	–	–	–
Sample survey (B)	✓	–	✓	–	✓

Table 6. Table 6 : Relative standard errors of estimators for vacancies in the Demand for Labour survey for Q4 of 2011, 2013 and 2014

Section	2011	2013	2014
Total	3.40	4.01	3.98
C - Manufacturing	5.50	5.27	5.64
F - Construction	13.86	19.21	15.12
G - Trade; repair of motor vehicles	13.69	15.75	16.33
H - Transportation and storage	8.07	9.93	9.17
I - Accommodation and catering	15.99	20.78	18.26
J - Information and communication	6.30	7.04	11.50
K - Financial and insurance activities	7.00	8.36	7.43
M - Professional, scientific and technical activities	8.12	8.71	12.01
N - Administrative and support service activities	23.09	12.89	17.76
O - Public administration and defence; compulsory social security	3.19	3.50	2.56
P - Education	8.85	10.65	12.06
Q - Human health and social work activities	5.53	6.88	6.00
R - Arts, entertainment and recreation	7.08	8.68	9.28
S - Other service activities	18.09	21.29	20.77

Table 7. Table 7 : Relative standard errors of estimators for vacancies by occupation (2-digit code) in Q4 of 2011, 2013 and 2014 based on the proposed bootstrap procedure

Year	Min	Q1	Median	Mean	Q3	Maximum
2011	2.38	4.78	6.42	7.84	10.20	20.05
2013	3.49	4.94	6.50	8.18	10.12	20.40
2014	2.32	5.50	6.55	8.26	10.70	17.56

Table 8. Table 8 : Point estimates of the fraction of skills for the pooled sample for 2011, 2013 and 2014

SKILLS	HTSRS	ECGREG	ECMC	ECLASSO1	ECLASSO2	ECALASSO1
Artistic	15.8	12.3	12.4	12.5	13.0	12.5
Availability	20.9	19.8	19.7	19.6	21.5	19.5
Cognitive	20.9	14.3	14.3	14.6	14.0	14.6
Computer	33.0	22.2	22.0	22.3	23.0	22.6
Interpersonal	53.8	34.5	34.5	35.1	35.0	34.9
Managerial	26.2	16.7	16.5	16.8	17.7	16.8
Mathematical	0.4	0.4	0.4	0.4	0.4	0.4
Office	3.9	3.1	3.1	3.2	3.4	3.2
Physical	5.4	7.4	7.6	7.5	8.2	7.6
Self-organization	58.6	43.8	43.5	43.9	46.2	43.8
Technical	4.3	7.5	7.7	7.7	8.3	7.7

Table 9. Table 9 : Average estimates of relative standard errors for skills over 2011, 2013 and 2014

SKILLS	MCGREG	ECMC	ECLASSO1	ECLASSO2	ECALASSO1
Artistic	11.1	3.5	3.4	3.4	3.4
Availability	22.0	1.0	0.9	1.5	1.0
Cognitive	25.4	8.5	8.1	9.3	8.2
Computer	24.9	12.9	12.4	12.7	12.4
Interpersonal	17.6	6.6	6.3	6.6	6.4
Managerial	15.3	5.6	5.3	5.5	5.4
Mathematical	15.6	4.1	4.0	3.2	4.1
Office	33.5	4.7	4.4	4.4	4.6
Physical	32.6	4.1	4.2	4.7	4.3
Self-organization	16.7	3.8	3.6	3.5	3.6
Technical	25.1	5.3	5.2	7.8	5.2

Table 10. Table 10 : Eleven general skills categories used in the Study of Human Capital in Poland

Skill	Behavior dimension	Behavior sub-dimension
Artistic	artistic and creative skills	–
Availability	availability	readiness to travel frequently; flexible working hours (no fixed slots)
Cognitive	seeking an analysis of information, and drawing conclusions	quick summarising of large volumes of text; logical thinking, analysis of facts; continuous learning of new things
Computer	working with computers and using the Internet	basic knowledge of MS Office-type package; knowledge of specialist software, ability to write applications and author websites; using the Internet: browsing of websites, handling e-mail
Interpersonal	contacts with other people (with colleagues, clients, people in the care)	cooperation within the group; ease in establishing contacts with colleagues and/or clients; being communicative and sharing ideas clearly; solving conflicts between people
Managerial	managerial skills and organisation of work	assigning tasks to other members of staff; coordination of work of other staff; disciplining other staff – taking them to task;
Mathematical	performing calculations	performing simple calculations; performing advanced mathematical computations
Office	organisation and conducting office works	–
Physical	physical fitness	–
Self-organization	self-organisation of work and showing initiative (planning and timely execution of tasks at work, efficiency in pursuing a goal)	independent making of decisions; entrepreneurship and showing initiative; creativity (being innovative, inventing new solutions); resilience to stress; timely completion of planned actions
Technical	technical imagination and handling technical devices	handling technical devices; repairing technical devices

Table 11. Table 11 : Coding precision, measured by the number of job offers with codes of differing accuracy (different number of digits) based on pooled data from 2011, 2013 and 2014 for occupation

Number of digits	Job offers
6 digits	33 966
5 digits	2 663
4 digits	715
3 digits	614
2 digits	142
1 digits	138

Table 12. Table 12 : Information available in published datasets from the Study of Human Capital in Poland despite its quality

Variable	2011	2012	2013	2014
Occupation (up to 6 digits)	X	–	X	X
Occupation (only 1 digit)	X	X	X	X
NACE (up to 3 digits)	X	X	X	X
Industry	X	X	X	X
Province	X	X	X	X
Subregion	X	X	x	X
Education	X	X	X	X
Foreign languages	X	X	X	X
Work experience	X	X	X	X

Table 13. Table 13 : Percentage of missing data in selected variables in each wave of the Human Capital in Poland survey

Variables	2011	2013	2014
Occupation	0.33	0.40	0.49
NACE	6.04	56.86	41.98
Voivodeship	1.06	0.01	0.21

Table 14. Table 14 : Distribution of Occupation (ISCO-08 2-digit codes) in Population and HC data (average over 2011, 2013 and 2014)

Occupation	Population	Online data	CV
11 - Chief executives, senior officials and legislators	0.49	1.68	4.50
12 - Administrative and commercial managers	1.70	2.22	4.53
13 - Production and specialized services managers	1.27	2.02	9.56
14 - Hospitality, retail and other services managers	0.29	2.78	11.84
21 - Science and engineering professionals	4.45	4.09	4.64
22 - Health professionals	3.33	1.47	6.45
23 - Teaching professional	0.91	2.00	10.75
24 - Business and administration professionals	6.73	14.65	3.74
25 - Information and communications technology professionals	3.94	8.17	8.10
26 - Legal, social and cultural professionals	0.71	0.91	5.54
31 - Science and engineering associate professionals	1.51	1.50	6.03
32 - Health associate professionals	0.79	0.58	6.31
33 - Business and administration associate professional	4.37	19.33	3.67
34 - Legal, social cultural and related associate professionals	0.97	0.53	5.66
35 - Information and communications technicians	1.73	0.78	6.73
41 - General and Keyboard Clerks	1.41	1.82	2.90
42 - Customer Services Clerks	5.20	2.53	6.70
43 - Numerical and Material Recording Clerks	1.28	1.46	6.63
44 - Other Clerical Support Workers	2.52	0.51	4.09
51 - Personal Services Workers	2.41	3.10	16.14
52 - Sales Workers	8.79	16.49	15.35
54 - Protective Services Workers	1.28	1.16	13.77
71 - Building and Related Trades Workers (excluding Electricians)	9.53	1.71	17.09
72 - Metal, Machinery and Related Trades Workers	7.09	2.36	5.27
73 - Handicraft and Printing workers	0.72	0.25	5.76
74 - Electrical and Electronics Trades Workers	2.09	1.23	13.99
75 - Food Processing, Woodworking, Garment and Other Craft and Related Trades Workers	5.64	1.18	5.35
81 - Stationary Plant and Machine Operators	2.71	0.35	5.19
82 - Assemblers	2.72	0.21	6.27
83 - Drivers and Mobile Plant Operators	7.99	1.67	9.23
91 - Cleaners and Helpers	1.35	0.19	10.09
93 - Labourers in Mining, Construction, Manufacturing and Transport	2.20	0.49	8.52
94 - Food Preparation Assistants	1.26	0.26	19.34
96 - Refuse Workers and Other Elementary Workers	0.60	0.31	5.41

Table 15. Table 15 : Quality of the model measured by Area Under Curve (AUC; average over 500 bootstrap replicated)

SKILLS	ECLASSO1	ECLASSO2	ECALASSO1
Technical	0.829	0.846	0.829
Mathematical	0.784	0.818	0.784
Artistic	0.665	0.672	0.665
Computer	0.748	0.755	0.748
Cognitive	0.644	0.654	0.644
Managerial	0.722	0.731	0.722
Interpersonal	0.731	0.750	0.731
Self-organization	0.695	0.708	0.695
Physical	0.687	0.713	0.687
Availability	0.605	0.635	0.604
Office	0.671	0.681	0.670

Equations28

E_{A} [i \in s_{A} \sum g (w_{i}, d_{i}^{A}) / q_{i}],

E_{A} [i \in s_{A} \sum g (w_{i}, d_{i}^{A}) / q_{i}],

i \in s_{A} \sum w_{i} x_{i}^{T} = T^{X},

i \in s_{A} \sum w_{i} x_{i}^{T} = T^{X},

w^{GREG} = d^{A} + D^{A} X (X^{T} D^{A} X)^{- 1} (T^{X} - (d^{A})^{T} X)^{T} .

w^{GREG} = d^{A} + D^{A} X (X^{T} D^{A} X)^{- 1} (T^{X} - (d^{A})^{T} X)^{T} .

\overline{T}_{y_{k}}^{GREG} = i \sum n_{A} w_{i}^{GREG} y_{k i} / i \sum n_{A} w_{i}^{GREG} .

\overline{T}_{y_{k}}^{GREG} = i \sum n_{A} w_{i}^{GREG} y_{k i} / i \sum n_{A} w_{i}^{GREG} .

w^{ECGREG} = d^{A} + D^{A} X (X^{T} D^{A} X)^{- 1} (T^{X} - (d^{A})^{T} X)^{T},

w^{ECGREG} = d^{A} + D^{A} X (X^{T} D^{A} X)^{- 1} (T^{X} - (d^{A})^{T} X)^{T},

\overline{T}_{y_{k}}^{ECGREG} = i \sum n_{A} w_{i}^{ECGREG} y_{k i} / i \sum n_{A} w_{i}^{ECGREG} .

\overline{T}_{y_{k}}^{ECGREG} = i \sum n_{A} w_{i}^{ECGREG} y_{k i} / i \sum n_{A} w_{i}^{ECGREG} .

E_{ξ} (y_{k i} ∣ x_{k i}) = μ (x_{k i}, β_{k}), V_{ξ} (y_{k i} ∣ x_{k i}) = v_{k i}^{2} σ^{2},

E_{ξ} (y_{k i} ∣ x_{k i}) = μ (x_{k i}, β_{k}), V_{ξ} (y_{k i} ∣ x_{k i}) = v_{k i}^{2} σ^{2},

w_{k}^{M C} = d^{A} + D^{A} M (M^{T} D^{A} M)^{- 1} (T^{M} - (d^{A})^{T} M)^{T},

w_{k}^{M C} = d^{A} + D^{A} M (M^{T} D^{A} M)^{- 1} (T^{M} - (d^{A})^{T} M)^{T},

\overline{T}_{y_{k}}^{MC} = i \sum n_{A} w_{k i}^{MC} y_{k i} / i \sum n_{A} w_{k i}^{MC} .

\overline{T}_{y_{k}}^{MC} = i \sum n_{A} w_{k i}^{MC} y_{k i} / i \sum n_{A} w_{k i}^{MC} .

\overline{T}_{y_{k}}^{ECMC} = i \sum n_{A} w_{k i}^{ECMC} y_{k i} / i \sum n_{A} w_{k i}^{ECMC} .

\overline{T}_{y_{k}}^{ECMC} = i \sum n_{A} w_{k i}^{ECMC} y_{k i} / i \sum n_{A} w_{k i}^{ECMC} .

β_{k} = β_{k} argmin i = 1 \sum n^{A} [- y_{k i} (x_{k i} β_{k}) + lo g (1 + exp (x_{k i}^{T} β_{k})] + λ_{n^{A} k} j = 1 \sum p α_{k j}^{γ_{k}} ∣ β_{k j} ∣,

β_{k} = β_{k} argmin i = 1 \sum n^{A} [- y_{k i} (x_{k i} β_{k}) + lo g (1 + exp (x_{k i}^{T} β_{k})] + λ_{n^{A} k} j = 1 \sum p α_{k j}^{γ_{k}} ∣ β_{k j} ∣,

y_{ik t} = {10 if i -th job offer contains k -th skill in year t o t h er w i se .

y_{ik t} = {10 if i -th job offer contains k -th skill in year t o t h er w i se .

var (\hat{θ}_{y_{k}}) = \frac{1}{B - 1} b = 1 \sum B (\hat{θ}_{y_{k i}}^{*} - \overline{\hat{θ}}_{y_{k}}^{*})^{2}, \overline{\hat{θ}}_{y_{k}}^{*} = \frac{1}{B} b = 1 \sum B \hat{θ}_{y_{k i}}^{*}

var (\hat{θ}_{y_{k}}) = \frac{1}{B - 1} b = 1 \sum B (\hat{θ}_{y_{k i}}^{*} - \overline{\hat{θ}}_{y_{k}}^{*})^{2}, \overline{\hat{θ}}_{y_{k}}^{*} = \frac{1}{B} b = 1 \sum B \hat{θ}_{y_{k i}}^{*}

CV (\hat{θ}_{y_{k}}) = var (\hat{θ}_{y_{k}}) / \overline{\hat{θ}}_{y_{k}}^{*} \times 100%.

CV (\hat{θ}_{y_{k}}) = var (\hat{θ}_{y_{k}}) / \overline{\hat{θ}}_{y_{k}}^{*} \times 100%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurvey Methodology and Nonresponse · Consumer Market Behavior and Pricing · COVID-19 epidemiological studies

Full text

Enhancing the Demand for Labour survey by including skills from online job advertisements using model-assisted calibration

Beręsewicz Maciej111Corresponding author: [email protected], Białkowska Greta, Marcinkowski Krzysztof,

Maślak Magdalena, Opiela Piotr, Pater Robert222This work was supported by the Polish Ministry of Science and Higher Education [DIALOG 0127/2016 to B.M. and P.R.], Zadroga Katarzyna

Abstract

In the article we describe an enhancement to the Demand for Labour (DL) survey conducted by Statistics Poland, which involves the inclusion of skills obtained from online job advertisements. The main goal is to provide estimates of the demand for skills (competences), which is missing in the DL survey. To achieve this, we apply a data integration approach combining traditional calibration with the LASSO-assisted approach to correct representation error in the online data. Faced with the lack of access to unit-level data from the DL survey, we use estimated population totals and propose a bootstrap approach that accounts for the uncertainty of totals reported by Statistics Poland. We show that the calibration estimator assisted with LASSO outperforms traditional calibration in terms of standard errors and reduces representation bias in skills observed in online job ads. Our empirical results show that online data significantly overestimate interpersonal, managerial and self-organization skills while underestimating technical and physical skills. This is mainly due to the under-representation of occupations categorised as Craft and Related Trades Workers and Plant and Machine Operators and Assemblers.

Word count: 5 869.

1 Introduction

The process of matching job seekers with job offers is becoming increasingly complicated given technological development and wider access to knowledge. The structural mismatch between labour supply and demand is one of the most challenging problems that need addressing in the labour market. It requires continuous attention of labour market and educational institutions. This mismatch can be determined by various factors. Boudarbat and Chernoff (2012) have observed that it depends more on educational characteristics than demographic and socioeconomic factors. Educational mismatch has been widely analysed from the perspective of levels and fields of education, as well as occupations (Somers et al., 2019).

So far, it has been possible to analyse and address such problems as under-education, over-education, occupational and educational mismatch. However, problems of skills (competences) mismatch, to a large extent, remain unresolved. They encompass the problems of skills shortage, gap and obsolescence (McGuiness and Pouliakas, 2018). Chevalier (2011) shows that the gap in mean salaries between graduates of different fields of study is smaller than the range of salaries of graduates of the same field of study. Hershbein and Kahn (2018) argue that by looking directly at the skill requirements in job offers rather than relying on assumptions about the skills associated with a particular occupation, it is possible to document the evolution in skill requirements for this occupation over time. Even though the demand for job-related skills can, to some extent, be evaluated by looking at the occupational and educational composition of jobs, it is impossible to make inferences about the demand for transversal skills.

Workers’ skills have been measured from a macroeconomic perspective in a number of studies, including the OECD’s Survey of Adult Skills from the Program of International Assessment of Adult Competences (McGowan and Andrews, 2015). However, continuous research on the demand for skills is scarce. Surveys conducted by national statistical institutions (NSI) provide representative data on vacancies across occupational groups or economic sectors. However, they lack detailed information including measures of skills. Pater et al. (2019) show different classifications of skills that can be used to measure skills demand, though no international standard has been established. International Labour Organization (2014) states that ”In contrast to unemployment, however, which is generally measured according to international standards, a uniform typology or measurement framework regarding skills mismatch and related issues, such as skills shortages, is lacking”. One potential source of information on skills demand are online job offers placed by employers or entities that work on their behalf.

Big data and the Internet as a data source have become an important issue in statistics, particularly in official statistics. There are a number of multinational initiatives (e.g. ESSnet on Big Data; APPOR Task Force on Big Data; European Centre for the Development of Vocational Training, CEDEFOP) that focus on the quality and suitability of estimates based on new data sources to complement or supplement existing statistical information. For instance, Cedefop (2019a, b) provides an overview of job vacancies and trends in EU countries. The ESSnet on Big Data included a work package devoted to job vacancies (WP 1: Web-scraping – job vacancies). Its aim is to produce statistical estimates of online job vacancies using suitable techniques and specific methodologies. The intention was to explore a mix of sources including job search sites, job adverts on enterprise websites, and job vacancy data from third party sources. ESSnet on Big Data (2017, 2018) was devoted to web-scraping, text mining, classification and comparison with official statistics. The latter was either at the level of statistical units (companies) or based on NACE and occupancy variables. The project is being continued with three main tasks devoted to 1) methodological framework, 2) statistical output and 3) implementation requirements of prototypes in the relevant statistical production processes at European and national level (ESSnet on Big Data, 2019).

However, before online data can be used for official statistics, it is crucial to explore potential sources of representation errors (Zhang, 2012; Reid et al., 2017). In this context, Daas et al. (2015); Beręsewicz (2017); Citro (2014) discussed coverage, non-response and measurement errors. Japec et al. (2015); Pfeffermann (2015); Beręsewicz et al. (2018) address coverage and non-response error, which can lead to significant bias in big data sources, in particular if it is non-ignorable.

Recently, there has been a growing interest in research on the use of non-probability samples, including big data, together with probability samples. Kim et al. (2018); Yang and Kim (2018); Yang et al. (2019); Yang and Kim (2019) have developed a rigorous approach to data integration by means of mass imputation using the nearest neighbour approach and double robust estimation; Chen et al. (2018); McConville et al. (2017) proposed using LASSO regression to conduct model-assisted estimation assuming data are missing at random and Chen et al. (2019) extended this approach assuming that only estimated totals are known. Elliott and Valliant (2017); Valliant (2019); Beręsewicz et al. (2018); Buelens et al. (2018) give a general overview of possible approaches to deal with non-probability samples including pseudo-randomization and the model-based approach, while Japec et al. (2015); Citro (2014); Couper (2013) provide a general discussion about modern data sources for statistical purposes.

The online job market has been rapidly developing. Thus, online job offers provide interesting research possibilities. There is a growing body of economic literature on the use of online job offers, with increasing attention paid to skills (see, e.g., Kuhn and Skuterud (2004), Deming and Kahn (2018), Colombo et al. (2019), and Pater et al. (2019)). Hershbein and Kahn (2018), Marinescu and Rathelot (2018), and Colombo et al. (2019) recognize the issue of online data representativeness. They provide comparisons of online data they use to the data from representative surveys. However, none of the articles contains analysis of online data bias correction.

Currently, the DL survey in Poland is used to produce estimates of vacancies by occupation, economic activity sector (NACE) or company size. The goal of our study was to enhance the DL survey by including information about skills obtained from job advertisements. We reused online data collected for the purpose of The Study of Human Capital (HC) 2011-2014 in the module devoted to job ads. While different approaches to counteract selection bias are found in the literature, we applied pseudo-randomization (calibration) with modern assisting models (LASSO and Adaptive LASSO). We assumed that the selection bias was ignorable given auxiliary variables.

The article has the following structure. Section 2 is devoted to data sources about the Polish vacancy market, including official statistics and selected non-official sources. This section also describes data used in this study. In section 3 we describe methods of inference based on non-probability data including the bootstrap procedure to estimate variance when only limited population level data is reported. Section 4 presents empirical results for 11 skills obtained from online data. The article ends with conclusions.

2 Data

2.1 The demand for labour survey

Currently, the DL survey conducted by Statistics Poland (Statistics Poland, 2018) is the main source of information about job vacancies in Poland. It is designed to obtain information on the satisfied and unsatisfied demand, i.e. the employed (occupied jobs), the vacancies, the newly created jobs, and the liquidated jobs. In 2005 the format of the survey on labour demand was changed in accordance with the Eurostat requirements in order to keep the survey content and methodology uniform across all EU Member States. Since 2007, the survey has been carried out as a sample survey and covers entities of the national economy employing at least one person.

The DL survey is carried out as a probability sample survey. The survey sample of 100,000 units is selected separately for units employing more than 9 persons (50,000), and separately for units employing up to 9 persons (50,000). With regard to large and medium-sized units, the sample is stratified by activity (19 NACE333The Statistical classification of economic activities in the European Community, abbreviated as NACE, is the classification of economic activities in the European Union (EU); the term NACE is derived from the French Nomenclature statistique des activités économiques dans la Communauté européenne. Various NACE versions have been developed since 1970. sections) and by province (16 NUTS2 regions), resulting in 304 separate subpopulations. Inside each of the subpopulations, the units are sorted in a descending order according to the number of employees. The largest units in each subpopulation that meet the threshold of the number of employees are included in the survey without sampling. Then the sample of the previously determined size is selected from the remaining parts of particular subpopulations.

As regards small units, employing up to 9 persons, the main purpose of the survey is to obtain results by NACE section. Allocation is carried out between different NACE sections in order to obtain the same expected precision. Within sections, units are stratified by province and then the sample is selected using the stratified, proportional sampling scheme.

Following a significant change made in the Polish classification of occupations in 2011, data collected in the DL survey before 2011 are not comparable to those collected afterwards. Additionally, in 2018 an additional question was included in the survey questionnaire about whether the responding entity placed job offers in district employment offices (DEOs). Each NUTS4 district (Pol. powiat) in Poland has its own DEO.

The survey suffers from non-response, which amounted to 35.2% in 2011, 36.6% in 2012, 38.1% in 2013 and 38.2% in 2014. Correction for this error involves multiplying sampling weights by the inversion of response rates within particular strata and calibration to meet the known population totals.

The survey defines the following terms for the measurement of labour demand:

•

Vacancies are positions or jobs unoccupied owing to labour turnover or newly created that simultaneously meet the following three conditions: (1) were actually unoccupied on the survey day, (2) the employer had made efforts to find people willing to take up the job, (3) if adequate candidates were found to occupy the vacancies, the employer would readily take them in.

•

Newly created jobs are jobs created in the course of organizational changes, expansion or change of business activity, as well as all jobs available in newly established companies.

The DL survey provides quarterly estimates about the number of job vacancies by 1) occupation (9 major groups and sub-groups of more detailed occupations denoted by 2-digit codes; with the exclusion of the 10th major group - occupations in the armed forces), 2) NACE, 3) company size (1-9, 10-49 and 50+ employees), 4) ownership type (public, private) and 5) province (16 units). In addition to marginal distributions, estimates of joint distributions of job offers that are published are limited to two-way interactions. Information about precision, measured by relative standard error, is published on an yearly basis only for marginal distributions of auxiliary variables and varies from 2% to 20%.

In the study we examined occupation (2-digit codes), NACE and province as potential auxiliary variables to reduce the selection bias in skills described in job offers. We used only estimated totals reported by Statistics Poland as we did not have access to micro-data from the survey. We decided to disregard Skilled agricultural, forestry and fishery workers as this group accounts for less than 1% of all job vacancies. Table 1 contains information about estimated vacancies for the first quarters of 2011, 2013 and 2014 based on the DL survey.

2.2 Online job advertisements

2.2.1 The Study of Human Capital in Poland

The Study of Human Capital in Poland (HC), a cross-sectional survey to monitor the labour market, was carried out between 2010-2015. The survey was resumed in 2017 but in a narrower scope. The survey was conducted by the Polish Agency for Enterprise Development (PAED) and the Centre for Evaluation and Analysis of Public Policies at the Jagiellonian University (CEAPP). The survey consisted of four modules: (1) survey of employers, (2) survey of job offers, (3) working-age population survey and (4) representatives of training institutions. The aim was to keep track of the situation in the Polish labour market, monitor the supply and demand for skills as well as the system of education and professional training in Poland in the period 2010-2015. The data from the survey are freely available from the HC survey website (https://bkl.parp.gov.pl), the methodology of the study is described in Czarnik (2011) and the data collection procedure is described in the report of Polish Agency for Enterprise Development (2011). The focus of the survey is limited to the working-age population.

The goal of the HC survey of job offers was to provide characteristics of skills and occupations included in job offers, not produce estimates of these characteristics for all job offers in Poland.

The statistical unit defined in the job ads module was a unique job offer for a single position, published on a given day, excluding internships for students and pupils and jobs in foreign countries. The survey did not distinguish between seasonal, part-time or full-time job offers. This definition differs from the one used in official statistics because it is related to the job description rather than the vacancy. However, we assumed that information included in the job ad can be taken to reflect the job vacancy.

A mixed mode of data collection was used. Job offers were obtained from a random sample of 160 public employment offices (DEOs; stratified by 16 provinces) in 2010 and the job search engine www.Careerjet.pl. Job offers had to meet specific requirements for the day of the survey, which was the 4th Monday of March of every year (except for 2010, when the data collection was conducted in September). Therefore, the survey was designed to be comparable between successive years.

The survey was carried out in three stages. In the case of DEOs, data were collected from the Central Job Offers Database (CBOP), an online service maintained by the Ministry of Family, Labour and Social Policy. According to the reports, the coverage of selected DEOs was insufficient, which is why DEOs’ staff was contacted to collect all current job ads.

Data from the Careerjet.pl website were collected in a semi-automated manner. Interviewers took a screenshot of a displayed job offer or saved the page as an html file, and then entered the data using the copy-and-paste method according to a preset format. Each offer was a separate text file with a corresponding identifier. Then specially prepared software transformed the dataset for the coding process.

The job offers were coded according to a categorization key containing a list of skills, occupations and other features. Each offer was coded independently by two coders. Table 11 in the Appendix contains information about the coding precision for occupations and NACE sections indicated by the number of digits in a given code; the higher the number of digits, the more detailed the occupation specification is. Each year, the coding reliability index was calculated based on a sample of 100 job offers, which represents the total number of codes used and coding consistency of coders. These ratios are presented in Table 2.

Verification of offers consisted in removing duplicate ads and those that did not meet the adopted selection criteria. The database did not include offers of low-quality data (where it was impossible to determine the place of work and the recruitment area, as well as offers with insufficient information). The uniqueness of offers from the second survey edition in 2011 was verified at the level of the database, not at the stage of obtaining job offers. Duplicates were distinguished by comparing (1) publication date, (2) source, (3) city, (4) province, (5) job offer reference number (not the ad’s ID), (5) company name and (6) occupation. Without access to the raw data, we assumed that publicly available databases contained unique job offers.

Job offers from DEOs selected for the sample were assumed to be valid for the day of the data collection. In the case of Careerjet.pl, first the job offers registered on the day of data collection were downloaded and coded. The target sample size for each year was set at 20000, which included all job offers collected from from DEOs plus as many ads from Careerjet.pl as necessary to reach the target. Table 2 presents the initial sample size (before deduplication) and final sample size for each survey year, including the collection date.

The coding precision was lowest in 2010, which is not surprising as this was the first year of the study. The index increases in the subsequent years. Ads from Careerjet.pl accounted for about 60% of all job offers collected in the survey.

Results reported in the article are based on the survey of job offers conducted between 2011 and 2013-2014 (three waves). For 2012, the publicly available dataset contained only 1-digit occupations (9 groups, see Table 12) and it was not possible to obtain the full dataset from the survey administrator. Therefore, we decided to take the following steps regarding the final dataset:

•

avoid the underrepresentation of skills in job offers from DEOs by focusing only on the data from the Internet (Careerjet.pl),

•

disregard occupations with single digit code (143 records),

•

disregard the 6th occupation category (i.e. skilled agricultural, forestry and fishery workers) because of the small number of job vacancies reported in the DL survey,

•

disregard the following NACE sections: A (Agriculture, Forestry And Fishing), B (Mining And Quarrying), D (Electricity, Gas, Steam And Air Conditioning Supply), E (Water Supply; Sewerage, Waste Management And Remediation Activities), L (Real Estate Activities) for lack of population totals with estimated standard errors.

As a result, the final dataset for the waves in 2011, 2013 and 2014 consisted of a total of 38 100 observations. There were 34 two-digit occupation codes, 16 provinces and 16 NACE 444We collapsed underrepresented NACE sections and occupation codes for job ads and did the same for the DL survey data. See supplementary materials for the whole data processing report..

2.2.2 Skills measured in the study

The HC survey proposes a classification of skills for the analysis of the vacancy market. It was prepared after reviewing various skills classifications used by different international institutions, including: institutions dealing with statistical data (e.g. the Australian Bureau of Statistics), those that develop skills standards (e.g. National Classification of Professional Standards), and enterprises responsible for the development of professional skills (e.g. O*NET. The Occupational Information Network). For more details see Czarnik (2011, chap. 2) attached in the Online Supplementary Materials. The survey distinguished the following skills:

Artistic – artistic and creative skills, 2. 2.

Availability – availability to work for the employer, 3. 3.

Cognitive – finding and analyzing information, drawing conclusions, 4. 4.

Computer – working with computers and using the Internet, 5. 5.

Interpersonal – contacts with others, 6. 6.

Managerial – managerial skills and organization of work, 7. 7.

Mathematical – performing calculations, 8. 8.

Office – organization of and conducting office tasks, 9. 9.

Physical – physical fitness, 10. 10.

Self-organization – self-organisation, initiative, punctuality, 11. 11.

Technical – handling, assembling and repairing equipment.

A detailed description of the skills categories is presented in Table 10. During the coding process 1 was used if a given skill was included in the job description, 0 = otherwise. There were almost no missing data in variables denoting skills as the lack of a given skill was indicated by 0.

Table 3 presents the share of given competences included in job offers according to the data source – Careerjet.pl (the Internet) and DEOs. For example, self-organisation was included in 59.1% of all job offers on the Internet while only in 7.6% of offers in DEOs. Spearman correlation coefficient between shares of competences measured in online offers and DEOs was equal to 0.74.

Data from the Internet were much richer in terms of published content in job offers, compared to DEO data. Employers placing job offers online tended to prepare much more detailed descriptions and therefore managed to better specify their requirements. As regards DEOs, the content (and form) of ads was limited by the input format, which allows the employer to enter the sought-after occupation and any preferences regarding education or knowledge of a foreign language.

The design of the HC survey did not include imputation of missing data; for instance 254 records had missing values in the occupation and 268 in the province. The highest number of missing values was recorded for NACE (over 22,000), which is mainly due to the lack of information about the company in the ads. The share of missing data varied between the survey waves as presented in Table 13. Therefore, we imputed missing data in occupation, NACE and province based on one nearest neighbour with Gower distance and weights assigned to columns that are based on variable importance from random forest. This approach is implemented in the VIM package (Kowarik and Templ, 2016) and was applied to the original dataset.

2.2.3 Correlation with auxiliary variables

Given limited access to totals estimated in the DL survey, the correlation of auxiliary variables was assessed only for occupation, NACE section and province. Cramer’s V correlation coefficients are presented in Table 4. The most correlated variable is occupation and the least correlated is province. This is reasonable because skills are specified for occupation rather than for the place of work or the company’s type of activity.

The highest correlations are observed for interpersonal, computer and managerial skills, which suggests that the use of auxiliary variables could reduce selection bias in the case of these skills. The weakest relationship is observed for office, physical and mathematical skills, which means that any correction for selection bias based on these variables is not likely to be effective.

3 Methods

3.1 Data integration approach

Enhancing probability survey with online data (i.e. non-probability sample) may be achieved by data integration. Table 5 presents the case when population, sample survey and online data are considered. First three columns denote variables available at unit-level data. $\boldsymbol{X}$ denote auxiliary variables such as occupation or NACE sector, target variable(s) denoted by $\boldsymbol{Y}$ and $\boldsymbol{d}$ are weights used for inference based on sample survey. Last two columns contain either known $\boldsymbol{T}^{\boldsymbol{X}}$ or estimated totals $\widehat{\boldsymbol{T}}^{\boldsymbol{X}}$ for auxiliary variables $\boldsymbol{X}$ . Note that we assume $\boldsymbol{X}$ are available in all sources, while $\boldsymbol{Y}$ only for online data. For simplicity, we assume that weights available in sample survey are already corrected for coverage and non-response errors. That is often the case when National Statistical Institutions provide unit-level data with only one set of weights. Note that setting presented in table 5 takes into account case when totals for some domains created by $\boldsymbol{X}$ are available (either known from the population data or estimated from sample survey).

The goal of data integration is to estimate some quantity (e.g. mean, total) of target variables $\boldsymbol{Y}$ present only in online data. Elliott and Valliant (2017) summarised possible approaches that consider pseudo-randomization (i.e calibration) or model-based approach. In addition, Kim and Wang (2018) consider mass imputation and double robust estimation that take into account propensity score weighting.

In the paper we consider pseudo-randomization approach in which pseudo-weights from non-probability sample are calibrated to estimated totals $\widehat{\boldsymbol{T}}^{X}$ or estimated total of $\boldsymbol{Y}$ based on approach introduced by Wu and Sitter (2001) and further developed for non-probability samples by Chen (2016). Detailed description is presented in the sections below.

3.2 Traditional calibration

Calibration was proposed by Deville and Särndal (1992) and is a method of searching for so called calibrated weights by minimizing the distance measure between the sampling weights and the new weights, which satisfy certain calibration constraints. As a consequence, when the new weights are applied to the auxiliary variables in the sample, they reproduce the known population totals of the auxiliary variables exactly. It is also important that the new weights should be as close as possible to sampling weights in the sense of the selected distance measure (Särndal and Lundström, 2005).

Following the notation in Chen et al. (2019), let us define the online (non-probability) sample as $s_{A,t}$ of size $n_{A,t}$ where $t=1,...,T$ denotes the wave. For simplicity, we drop subscript $t$ . This sample contains variables of interest $Y_{k}$ , where $k=1,...,K$ . Further, let $\boldsymbol{d}^{A}_{n_{A}\times 1}$ be a vector of pseudo-weights that are typically set to $N/n_{A}$ for all units $i\in s_{A,t}$ , where $N$ is the size of the target population. In this approach we assume simple random sampling design for sample $s_{A}$ .

Let $\boldsymbol{D}^{A}$ be a diagonal matrix of pseudo-design weights and $\boldsymbol{w}_{n_{A}\times 1}$ be calibrated weights that minimize an expected distance measure with respect to the design of $A$

[TABLE]

under the constraint:

[TABLE]

where $\boldsymbol{T}^{\boldsymbol{X}}$ is a row vector of estimated population totals (e.g. from the reference, external probability sample) of sample calibration variables $\boldsymbol{X}$ and $g(w_{i},d_{i}^{A})$ is a differentiable function with respect to $w_{i}$ , strictly convex on an interval containing $d_{i}^{A}$ and $g(d_{i}^{A},d_{i}^{A})=0$ . The commonly used generalized regression (GREG) estimator uses the $\chi^{2}$ distance $g(w_{i},d_{i}^{A})=(w_{i}-d_{i}^{A})^{2}/d_{i}^{A}$ . For this distance measure:

[TABLE]

The estimate of the population mean of outcome $\boldsymbol{y}_{k}$ assuming that we have $k$ target variables is based on calibrated weights:

[TABLE]

The calibrated weights defined do not rely on any outcome variable. Thus the same set of weights can be applied to all variables in the survey.

In the case when only estimates of totals $\widehat{\boldsymbol{T}}^{X}$ are known, Dever and Valliant (2010) introduced estimated control calibration. In this framework, we replace ${\boldsymbol{T}}^{X}$ in (3) with $\widehat{\boldsymbol{T}}^{X}$ , which results in

[TABLE]

and thus the estimated mean is given by

[TABLE]

Following Chen et al. (2019) we denote this estimator as ECGREG (Estimated control GREG) to distinguish it from GREG with known population totals.

3.3 Model-assisted calibration

Following results obtained by Chen (2016); Chen et al. (2018), we consider a model-assisted calibration approach using a plausible model. Model-assited calibration was proposed by Wu and Sitter (2001) and further extended by the above mentioned authors. The basic idea of model-assisted calibration is as follows. We build $k$ separate models for each target variable $\boldsymbol{y}_{k}$ using the same set of covariates denoted by $\boldsymbol{x}_{k}$ :

[TABLE]

where $\boldsymbol{\beta}_{k}=(\beta_{k1},...,\beta_{kp})^{T}$ and $\sigma$ are unknown superpopulation parameters. $\mu(\boldsymbol{x}_{ki},\boldsymbol{\beta}_{k})$ is a known function of $\boldsymbol{x}_{ki}$ and $\boldsymbol{\beta}_{k}$ , and $v_{ki}$ is a known function of $\boldsymbol{x}_{ki}$ or $\mu(\boldsymbol{x}_{ki},\boldsymbol{\beta}_{k})$ . $E_{\xi}$ and $V_{\xi}$ are expectation and variance with respect to the model $\xi$ .

Let $\boldsymbol{B}_{k}$ be the finite population (or census) estimate of $\boldsymbol{\beta}_{k}$ and $\hat{\mu}_{ik}=\mu(\boldsymbol{x}_{ki},\hat{\boldsymbol{B}}_{k})$ , where $\hat{\boldsymbol{B}}_{k}$ is the sample estimate of $\boldsymbol{B}_{k}$ . Then , the model-assisted calibrated weights $\boldsymbol{w}$ minimize a distance measure $E_{A}\left[\sum_{i\in s_{A}}g(w_{i},d_{i}^{A})/q_{i}\right]$ under constraints $\sum_{i=1}^{n}w_{i}=N$ and $\sum_{i=1}^{n}w_{i}\hat{\mu}_{ik}=\sum_{i=1}^{N}\hat{\mu}_{ik}$ . Under $\chi^{2}$ distance measure with $q_{i}=1$ , the model-assisted calibrated weights are:

[TABLE]

where $\boldsymbol{D}^{A}=diag(\boldsymbol{d}^{A})$ , ${\boldsymbol{T}}^{M}=(N,\sum_{i}^{N}\hat{\mu}_{i})$ and $\boldsymbol{M}=(\boldsymbol{1}^{A},(\hat{\mu}^{A})_{i\in s_{A}})$ . Note that in this approach we obtain $K$ sets of weights for each $\boldsymbol{y}_{k}$ variable separately. In this setting the population mean is given by

[TABLE]

If the totals are estimated from the reference, independent probability sample of size $n_{B}$ , then constraints are $\sum_{i=1}^{n^{A}}w_{i}=\sum_{i=1}^{n_{B}}d_{i}^{B}$ and $\sum_{i=1}^{n^{A}}w_{i}\hat{\mu}_{ik}=\sum_{i=1}^{n_{B}}d_{i}^{B}\hat{\mu}_{ik}$ , where $\boldsymbol{d}^{B}$ are weights from probability sample $B$ . Similarly, as in the case of GREG, we replace $\boldsymbol{w}^{MC}$ with the $\boldsymbol{w}^{ECMC}$ obtained from the estimated totals and get

[TABLE]

Further, we assume that $\mu(\cdot)$ is defined as a generalized linear model (i.e. logistic regression), LASSO and adaptive LASSO regression described in the following section.

3.4 Model-assisted calibration using adaptive LASSO

Least Angle Shrinkage and Selection Operator (LASSO) is a regularized regression that can perform both variable selection and parameter estimation (Tibshirani (1996)); it gained popularity because it prevents model over-fitting by selecting more accurate and parsimonious models. An adaptive LASSO was proposed by Zou (2006), which in the case of logistic regression assuming $k$ target variables, is given as

[TABLE]

where $\alpha_{kj}^{\gamma_{k}}$ is an adjustable weight and $\gamma_{n^{A}}$ is a penalty used to optimize a model fit measure, while other parameters remain as defined previously. Given $\lambda_{{n^{A}}k}$ and $\gamma_{k}$ , one can estimate $\widehat{\boldsymbol{\beta}}_{k}$ through iterative procedures. Common choice for $\alpha_{kj}$ is $1/|\widehat{\beta}_{kj}^{\text{MLE}}|$ where $|\widehat{\beta}_{kj}^{\text{MLE}}|$ is the maximum likelihood estimate of $\beta_{kj}$ or $1/|\widehat{\beta}_{kj}^{\text{RIDGE}}|$ obtained from ridge regression. If $\alpha_{kj}^{\gamma_{k}}=1$ then we get standard LASSO model. The power of the weight parameter, $\gamma_{k}$ , is a constant greater than 0 that interacts with $kj$ to control LASSO from selecting or excluding parameters. LASSO can be estimated using the glmnet package (Simon et al., 2011).

Then, to obtain the population mean we need to replace $\boldsymbol{w}^{MC}_{k}$ with corresponding $\boldsymbol{w}^{\text{ECLASSO}}_{k}$ from the standard LASSO model or $\boldsymbol{w}^{\text{ECALASSO}}_{k}$ obtained under the adaptive LASSO model. To obtain $\widehat{\boldsymbol{\beta}}_{k}$ we followed the approach proposed by Chen et al. (2018) and used the cross-validation procedure. For more details refer to Chen et al. (2019).

3.5 Estimators used in the paper

The outcome variable of interest is whether the description of a job offer ( $i=1,...,n_{A,t}$ ) contained a given skill. Let us define the binary indicator for the outcome variable $\boldsymbol{y}_{kt}$ for each $k=1,...,11$ -th skill and for each $t=\{2011,2013,2014\}$

[TABLE]

For each variable $k$ we calculate the following estimators:

•

$\widehat{\overline{T}}^{\text{HTSRS}}_{y_{kt}}=\sum_{i\in s_{A,t}}(N_{A,t}/n_{A,t})y_{ikt}$ , which is Horvitz-Thompson estimator using pseudo-weights,

•

$\widehat{\overline{T}}^{\text{ECGREG}}_{y_{kt}}=\sum_{i\in s_{A,t}}w_{it}^{\text{ECGREG}}y_{ikt}/\sum_{i\in s_{A,t}}w_{it}^{\text{ECGREG}}$ , where we use estimated totals for occupation (2-digit code; 34 levels). See Table 14.

•

$\widehat{\overline{T}}^{\text{ECMC}}_{y_{kt}}=\sum_{i\in s_{A,t}}w_{it}^{\text{ECMC}}y_{ikt}/\sum_{i\in s_{A,t}}w_{it}^{\text{ECMC}}$ , where we use a logistic regression model for each $y_{k}$ separately based on pooled data from all periods and one auxiliary variable denoting occupation (2-digit code; 34 levels).

•

$\widehat{\overline{T}}^{\text{ECLASSO1}}_{y_{kt}}=\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECLASSO1}}y_{ikt}/\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECLASSO1}}$ , where we use LASSO regression for each $y_{k}$ separately based on pooled data from all periods and one auxiliary variable denoting occupation (2-digit code; 34 levels).

•

$\widehat{\overline{T}}^{\text{ECLASSO2}}_{y_{kt}}=\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECLASSO2}}y_{ikt}/\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECLASSO2}}$ , where we use LASSO regression for each $y_{k}$ separately based on pooled data from all periods and two auxiliary variable denoting occupation (2-digit code; 34 levels) and NACE (14 levels).

•

$\widehat{\overline{T}}^{\text{ECALASSO1}}_{y_{kt}}=\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECALASSO1}}y_{ikt}/\sum_{i\in s_{A,t}}w_{ikt}^{\text{ECALASSO1}}$ , where we use adaptive LASSO regression with the seame settings as ECLASSO1.

3.6 Variance estimation

Chen et al. (2018, 2019) proposed analytical formulas for the asymptotic design variance which consists of two parts: 1) variance with respect to non-probability sample $A$ , and 2) variance with respect to probability sample $B$ . However, this approach requires access to unit-level data from the $s_{B}$ sample, which is not always the case. For example, these data cannot be obtained owing to the risk of disclosure or the cost of purchasing these data is very high.

Moreover, the estimated totals and their uncertainties can only be published in a limited form. For instance, the DL survey reports standard errors for the estimated totals of vacancies by size, type of company and NACE section separately. In addition, estimated errors are only published for the last quarter of each year in the annual report. There are no estimates of uncertainty measures for vacancies by occupation; fortunately, there are cross-classification estimates of vacancies for occupation by NACE. Table 6 presents estimated relative standard errors reported by Statistics Poland for the DL survey for 2011, 2013 and 2014. The precision varies between domains defined by NACE section but, in almost all cases, is lower than 20%. The highest standard errors are for Accommodation and Catering, and Administrative and Support Service Activities, while the lowest – for Manufacturing and Public Administration and Defence. Also, the estimates and relative standard errors for vacancies within NACE sections are stable over time.

In view of the limitations of the reporting procedure in the DL survey, we made the following assumption: standard errors are similar in a given year and we can approximate standard errors from the 1st quarter based on information from the 4th quarter. Without access to unit-level data, we could not verify the validity of this assumption but as the estimates of vacancies by NACE (and also by occupation) are stable over time this assumption is likely to be valid.

We used the bootstrap method to account for uncertainty in estimating the model based on $s_{A,t}$ and estimated totals from $s_{B,t}$ , which is described in Algorithm 1 below (for simplicity we drop subscript $t$ and also assume that the same totals are used for all $K$ variables).

where $\text{SD}()$ denotes standard errors derived from Table 6, $N()$ is normal distribution, $\widehat{\boldsymbol{T}}^{\text{NACE, OCCUP}}$ denotes estimated totals for cross-classification of NACE and Occupation (2-digit codes). Note that this part $\widehat{\boldsymbol{T}}^{\text{NACE}*}\times\widehat{\boldsymbol{T}}^{\text{NACE,OCCUP}}/\widehat{\boldsymbol{T}}^{\text{NACE}}$ assumes that we can split $\widehat{\boldsymbol{T}}^{\text{NACE}*}$ according to the estimate share of vacancies by occupation in a given NACE section. Table 7 presents information about relative standard errors of the estimated $\widehat{\boldsymbol{T}}^{\text{OCCUP}*}$ in the bootstrap procedure. Detailed results are presented in Supplementary materials in Table 14. Uncertainty varies from 2% to 20%, which is inline with errors for NACE or other variables reported in the DL survey.

The following steps were taken to calculate variance in the bootstrap procedure. Let $\hat{\theta}^{*}_{y_{k}}=\widehat{\overline{T}}^{\text{HTSRS}*}_{y_{k}},\widehat{\overline{T}}^{\text{ECGREG}*}_{y_{k}},\widehat{\overline{T}}^{\text{ECMC}*}_{y_{k}},\widehat{\overline{T}}^{\text{ECLASSO1}*}_{y_{k}},\widehat{\overline{T}}^{\text{ECLASSO2}*}_{y_{k}},\widehat{\overline{T}}^{\text{ECALASSO1}*}_{y_{k}}$ which are then used to derive variance and relative standard errors (CV) given by the following equations:

•

Variance

[TABLE]

•

Relative Standard Error (CV)

[TABLE]

To estimate variances for the aforementioned estimators we used bootstrap with 500 replicates. We compared the results in terms of relative standard errors. All calculations were done in R statistical software (R Core Team, 2018) using codes written by the authors and LASSO procedure provided in Chen et al. (2019). Data and R scripts to reproduce all calculations (including estimated models), tables and figures are available at https://github.com/BERENZ/job-offers-bkl or can be obtained on request.

4 Estimation of the demand for skills

Table 8 presents point estimates produced by means of the estimators presented in section 3.5. Column HTSRS is used for comparison to verify whether the models corrected the bias resulting from the specificity of online data. All bias-corrected estimates show similar demand for skills. The biggest differences between the bias-uncorrected (HTSRS) and corrected estimates are visible for the skills with high Cramer’s V correlation presented in Table 4, i.e. interpersonal, managerial or computer skills. For almost all categories, online job ads overestimate the share of skills required by employers.

The biggest difference (almost 54% vs 35%) between all estimators can be observed for interpersonal competences. Other groups where there is a high difference after adjusting for known population totals are managerial skills (almost 10 p.p.) and computer skills (over 10 p.p.). There are two groups that online jobs underestimate: technical and physical competences. This is mainly due to underrepresentation of two categories of occupations: (7) Craft and related trades workers and (8) Plant and machine operators and assemblers. See Table 14 in the supplementary materials.

These results show that the studies not taking into account the extent of selection bias across skill requirements in online job postings, may overvalue or undervalue some skills. For example Deming and Kahn (2018) shows higher relative demand, especially for cognitive and interpersonal (social) skills, and lower demand only for managerial skills in the US economy. Similarly, Hershbein and Kahn (2018) show higher than ours percent of online postings containing cognitive and computer skills requirements, also for the US economy. However, these differences to a high extent may result from large differences between analysed economies.

As can be seen, the estimates based on ECGREG, ECLASSO1,2 and ECALASSO1 are similar. This suggests that the variables used for the estimation provide comparable information despite the underlying model. Table 15 provides information about Area Under Curve (AUC) for each skill, ECLASSO1, ECLASSO2 and ECALASSO1 model. Based on this table, it can be concluded that the inclusion of two variables – occupation and NACE – results in a better model for each skill. The AUC varies from 0.644 for cognitive skills to 0.829 for technical competences, which indicates that the standard LASSO model is better than the adaptive one. Also, there are almost no differences between ECLASSO1 and ECALASSO1, which suggests that despite additional penalty the estimated parameters are close. Figure 1 provides a more detailed comparison of the estimated share of skills over the reference period.

Table 9 provides information about estimated relative standard errors for skills estimates for 2011, 2013 and 2014. ECMC and ECLASSO estimators are more efficient than MCGREG and ECMC is less efficient than estimators with LASSO. This is because MCGREG assumes a linear model and auxiliary variables are high dimensional. Note that despite higher AUC for ECLASSO2, it provides less precise estimates mainly due to the high number of dimensions of the auxiliary variables and variability in totals from the DL survey. Moreover, there are almost no differences between adaptive and non-adaptive LASSO, which suggests that the estimated parameters are probably correctly specified. Based on this result, we can choose estimates based ECLASSO1 as the final ones.

5 Conclusion

In the article we described our attempt to enhance the Demand for Labour survey conducted by Statistics Poland by including information about skills listed in online job advertisements. We considered online data as non-probability sample and apply methods that are developed for purpose of integration of probability and non-probability sample. In particular, we applied model-assisted estimators including generalized linear, LASSO and Adaptive LASSO models. Based on these results we conclude that the application of these methods reduced bias in online data for several skills but not for all. This can be explained mainly by the small correlation with the auxiliary variables used.

To our knowledge this is the first attempt to extend labour market surveys conducted by National Statistical Agencies by data from the Internet. Previous applications were devoted to non-probability samples based on web surveys or opt-in panels. Our approach shows that methods developed for non-probability samples may be applied for modern data sources such as big data. The latter is currently discussed in terms of auxiliary variables for small area estimation, nowcasting of selected indicators or creating new official statistics. However, there are some issues that should be discussed in detail.

The main limitation involved in the use of online data and combining them with existing surveys is the lack of auxiliary variables. For example, occupations or NACE information need to be extracted from the ad description or may not be even provided by employers. On the other hand, official statistics about the demand for labour are based on probability samples with restricted access to unit-level data (which limit possible approaches) or estimated totals for a certain level or cross-classification (often without uncertainty measures).

More generally, research on non-probability samples shows that using these data for statistics requires availability of good independent data sources. The main sources are either probabilistic samples or administrative records. Not always official statistics collects data that is required for the data integration purpose.

Another issue is measurement and unit error. In our study we associated job advertisement with job vacancy that may not be always the case. We also assumed that description included on job ads may be related to job vacancy occupations deported by Statistics Poland. This should be verified in the future by investigating job descriptions reported by entities in the DL survey.

Finally, in our study we used online data from 2011-2014 that was already coded and did not require text mining extract occupation or skills. These data may be actually used for preparing training data for machine learning. This is because original descriptions of job advertisements are associated with labels suited for machine learning purposes. However, one should keep in mind that data from the past not necessarily may hold for future job advertisements.

Despite these problems we conclude that online data combined with official statistics can provide a better picture of competences, education and other requirements made by employers and can be used to monitor changes by interested entities. In the time of decreasing response rates and budget cuts using data that is already "out there on the Internet" is tempting but requires a attention to its quality and selection of appropriate methods of inference.

Appendix A Appendix

A.1 Skills measured in the online data

A.2 Details about the online data

A.3 Estimation process and results

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Beręsewicz (2017) Beręsewicz, M. (2017). A two-step procedure to measure representativeness of internet data sources. International Statistical Review 85 (3), 473–493.
2Beręsewicz et al. (2018) Beręsewicz, M., R. Lehtonen, F. Reis, L. Di Consiglio, and M. Karlberg (2018). An overview of methods for treating selectivity in big data sources. Statistical working papers, Eurostat.
3Boudarbat and Chernoff (2012) Boudarbat, B. and V. Chernoff (2012). Education–job match among recent canadian university graduates. Applied Economics Letters 19 (18), 1923–1926.
4Buelens et al. (2018) Buelens, B., J. Burger, and J. A. van den Brakel (2018). Comparing inference methods for non-probability samples. International Statistical Review 86 (2), 322–343.
5Cedefop (2019 a) Cedefop (2019 a). Online job vacancies and skills analysis – A Cedefop pan-European approach.
6Cedefop (2019 b) Cedefop (2019 b). The online job vacancy market in the EU. Driving forces and emerging trends.
7Chen (2016) Chen, J. K. T. (2016). Using LASSO to Calibrate Non-probability Samples using Probability Samples . Ph. D. thesis.
8Chen et al. (2018) Chen, J. K. T., M. R. Elliott, and R. Valliant (2018). Inference for nonprobability samples. Survey Methodology 44 (1), 117–144.