Skip to content

Methodology: wage variables

Three key wage variables in the EPI CPS extracts contain additional imputations and restrictions to improve their usefulness: wage, wageotc, and weekpay. This section discusses these modifications.

If you'd like to use variables that don't include these adjustments, consider using wage_noadj, wageotc_noadj, and weekpay_noadj.

For a shorter overview of the wage variables, see this discussion.

Top-code adjustments

Weekly earnings in the underlying source data are top-coded or right-censored. Beginning in April 2023, the top-coding procedure changed from being a single value for weekly earnings (2,884.61), to a dynamically adjusted value each month. This dynamic value is calculated by Census, and is the weighted average of the reported earnings of the top 3% of earners. Between April 2023 and March 2024, the new top-coding procedure is phased in so that only respondants who entered the survey in 2023 or later are top coded "dynamically." During this time period, only respondants coded as 4 in minsamp will have the new top-coding procedure.

Prior to April 2023, the EPI extracts variable weekpay contains the gender- and year-specific imputed mean above the top-code assuming the earnings distribution is Pareto above the 80th percentile. For non-hourly workers, these imputed values are also incorporated in the hourly wage variables wage and wageotc. To use a weekly earnings variable without this adjustment, consider weekpay_noadj and the associated hourly wage variables wage_noadj, wageotc_noadj.

Function used to impute the mean above the topcode
capture program drop topcode_impute
program define topcode_impute, rclass
syntax varname [if] [pweight], generate(name) method(string) topcodeval(real) [threshold(real 80)]

marksample touse

local method = lower("`method'")
if "`method'" == "pareto" local methodname Pareto

if "`method'" ~= "pareto" {
    noi di "Only Pareto-distribution based imputation is available."
    error 1
}

di _n "Assume `varlist' has topcode = `topcodeval'"

* Pareto imputation
noi if "`method'" == "pareto" {
    * threshold check
    if `threshold' < 0 | `threshold' >= 100 {
        noi di "Threshold must be a percentile between 0 and 100"
        error 1
    }
    * need non-missing values of `varlist' for this to make sense
    capture assert `varlist' ~= . `if'
    if _rc ~= 0 {
        noi di "`varlist' needs to be nonmissing `if'"
        error 1
    }

    * grab treshold percentile
    if "`weight'" ~= "" _pctile `varlist' `if' [aw `exp'], p(`threshold')
    if "`weight'" == "" _pctile `varlist' `if', p(`threshold')
    local thresholdval = r(r1)
    if `thresholdval' == . {
        noi di "Unable to calculate value of percentile `threshold' for `varlist'"
        error 1
    }
    if `thresholdval' <= 0 {
        noi di "Percentile `threshold' value of `varlist' needs to be positive"
        error 1
    }

    preserve

    keep if `touse'
    tempvar n bin running
    replace `varlist' = `topcodeval' if `varlist' >= `topcodeval'
    gen `bin' = round((`varlist' / 50))*50
    gen `n' = 1
    collapse (sum) `n' [`weight' `exp'], by(`bin')
    gsort -`bin'
    gen `running' = sum(`n')

    * restrict to wage distribution above percentile threshold
    keep if `bin' >= round(`thresholdval' * 50) / 50

    * estimate shape parameter
    tempvar logrunning logbin
    gen `logrunning' = log(`running')
    gen `logbin' = log(`bin')
    reg `logrunning' `logbin'
    local alpha = -_b[`logbin']
    local newmeanabove = `topcodeval' * `alpha' / (`alpha' - 1)

    restore

    noi di "Assume `varlist' should be `methodname' distributed above percentile `threshold', i.e., `varlist' = " %6.2f `thresholdval'

    generate `generate' = `varlist' if `touse'
    replace `generate' = `newmeanabove' if `touse' & `generate' >= `topcodeval' & `generate' ~= .

    noi di "Below topcode `topcodeval', new variable `generate' = `varlist'"
    noi di "Above topcode `topcodeval', new variable `generate' = " %6.2f `newmeanabove'

    noi di "Ratio of new mean above to old top-code = " %4.2f `newmeanabove' / `topcodeval'

    return scalar topcodeval = `topcodeval'
    return scalar newmeanabove = `newmeanabove'

}

end
Specific top-code adjustment used for weekly earnings
********************************************************************************
* Weekly earnings (top-code adjusted)
********************************************************************************

capture confirm variable weekpay, exact
if _rc == 0 {
    drop weekpay
}
gen weekpay = .

if $monthlycps == 1 | $maycps == 1 {
    if $earnerinfo == 1 {
        * determine top-code thresholds
        if tm(1973m1) <= $date & $date <= tm(1988m12) {
            local topcodeval 999
        }
        if tm(1989m1) <= $date & $date <= tm(1997m12) {
            local topcodeval 1923
        }
        if tm(1998m1) <= $date & $date <= tm(2024m3) {
            * going to use 2884.60 instead of actual topcode of 2884.61 to avoid precision issues
            local topcodeval 2884.60
        }

        * determine sample weights to use
        if $monthlycps == 1 & $maycps == 0 local weightvar orgwgt
        if $monthlycps == 0 & $maycps == 1 local weightvar basicwgt



        * Do top-code adjustment
        * account for topcoding change after April 2024
        if tm(2024m4) <= $date {
            replace weekpay = weekpay_noadj
        }

        * there seems to be something wrong with ernwk and ernwkc in 1980 may data
        * coding weekpay in these data as missing for now
        else if $maycps == 1 & tm(1980m1) <= $date & $date <= tm(1980m12) {
            replace weekpay = .
        }

        else {
            * males: generate top-code adjusted weekly earnings
            topcode_impute weekpay_noadj if weekpay_noadj ~= . & female == 0 & age >= 16 & age ~= . [pw=`weightvar'], generate(weekpay_male) method(Pareto) threshold(80) topcodeval(`topcodeval')

            * females: generate top-code adjusted weekly earnings
            topcode_impute weekpay_noadj if weekpay_noadj ~= . & female == 1 & age >= 16 & age ~= . [pw=`weightvar'], generate(weekpay_female) method(pareto) threshold(80) topcodeval(`topcodeval')

            replace weekpay = weekpay_male if female == 0
            replace weekpay = weekpay_female if female == 1

            drop weekpay_male weekpay_female
        }
        *account for topcoding change in 2023/2024 where outgoing rotation groups are making their way through the changed procedure
        replace weekpay = weekpay_noadj if (year == 2023 & month >= 4 | year == 2024) & minsamp == 4 
    }
}

lab var weekpay "Weekly pay (top-code adjusted)"
notes weekpay: Dollars per week for nonhourly and hourly workers
notes weekpay: Includes overtime, tips, commissions
notes weekpay: Original top-code values replaced with Pareto-distribution implied mean above top-code
notes weekpay: Separate imputations for men and women
notes weekpay: Original top-code: 1973-88: 999; 1986-97: 1923; 1998-2023: 2884.61
notes weekpay: Beginning in 2023, top-code value is the weighted average of the top 3% of earners in a given month
notes weekpay: Derived from weekpay_noadj and tc_weekpay

Hours imputations

For data from 1994 to the present, respondents can report that their usual weekly "hours vary" at work. The EPI extracts contain an imputed usual weekly hours variable hoursu1i that is equal to usual hours worked at the primary job hoursu1, except when "hours vary", in which case it is based on the following regression-based predictions:

Demographic & industry-based usual hours prediction
*******************************************************************************
* Usual hours worked per week, primary job
*******************************************************************************
capture confirm variable hoursu1i, exact
if _rc == 0 {
    drop hoursu1i
}
gen hoursu1i = .

if tm(1994m1) <= $date {

    forvalues i = 2/5 {
        gen age`i' = age^`i'
    }

    local indepvars age age2 age3 age4 age5 i.educ i.wbho i.citistat i.married i.statefips i.union i.pubsec i.mind16

    gen orgsample = orgwgt > 0 & orgwgt ~= . & age >= 16 & age ~= . & (minsamp == 4 | minsamp == 8)

    * Regression: female, full-time
    reg hoursu1 `indepvars' if hoursu1 > 0 & orgsample == 1 & female == 1 & (3 <= hoursuint & hoursuint <= 6) [aw=orgwgt]
    predict hourpred_ft_f if orgsample == 1 & female == 1 & hoursuint == 7, xb

    * Regression: female, part-time
    reg hoursu1 `indepvars' if hoursu1 > 0 & orgsample == 1 & female == 1 & (1 <= hoursuint & hoursuint <= 2) [aw=orgwgt]
    predict hourpred_pt_f if orgsample == 1 & female == 1 & hoursuint == 8, xb

    * Regression: male, full-time
    reg hoursu1 `indepvars' if hoursu1 > 0 & orgsample == 1 & female == 0 & (3 <= hoursuint & hoursuint <= 6) [aw=orgwgt]
    predict hourpred_ft_m if orgsample == 1 & female == 0 & hoursuint == 7, xb

    * Regression: male, part-time
    reg hoursu1 `indepvars' if hoursu1 > 0 & orgsample == 1 & female == 0 & (1 <= hoursuint & hoursuint <= 2) [aw=orgwgt]
    predict hourpred_pt_m if orgsample == 1 & female == 0 & hoursuint == 8, xb

    * assign predicted values
    replace hoursu1i = hoursu1 if hoursu1 ~= .
    replace hoursu1i = hourpred_ft_f if female == 1 & hoursuint == 7
    replace hoursu1i = hourpred_pt_f if female == 1 & hoursuint == 8
    replace hoursu1i = hourpred_ft_m if female == 0 & hoursuint == 7
    replace hoursu1i = hourpred_pt_m if female == 0 & hoursuint == 8

    * round imputed values for consistency with original hoursu1
    replace hoursu1i = round(hoursu1i,1)

    * top-code at 99, like hoursu1
    replace hoursu1i = 99 if hoursu1i > 99 & hoursu1i ~= .

    replace hoursu1i = . if hoursu1i < 0

    * clean up
    drop hourpred_ft_f hourpred_pt_f hourpred_ft_m hourpred_pt_m
    drop age2-age5
    drop orgsample
}

lab var hoursu1i "Imputed usual weekly hours, main job (ORG only)"
capture label drop hoursu1i
lab def hoursu1i 99 "99+"
lab val hoursu1i hoursu1i
notes hoursu1i: Only available 1994-present ORG
notes hoursu1i: Same as hoursu1, except for those whose usual hours vary at main job

The imputed usual weekly hours is used when calculating hourly wages for nonhourly workers in wage and wageotc. If you do not want to include these hours imputations in your wage analysis, you can simply set wage values to missing for nonhourly (paidhre == 0) workers whose hours vary (hoursvary == 1).

Trimming outliers

The hourly wage variables wage and wageotc are trimmed of outliers. Specifically, these hourly wage values are set to missing if they are above 50 cents per hour or $100 per hour in 1989 dollars.

For hourly wage variables that do not have this modification, see wage_noadj and wageotc_noadj

Hourly wage limits for wage and wageotc

Download this table.

BLS wage allocations

In recent years, a large number of observations have weekly earnings or hourly wages imputed by the BLS. If you want to use a wage variable without any weekly or hourly earnings imputations by EPI or BLS, you can incorporate the allocation flags a_weekpay and a_earnhour.

Generate wage variable that excludes all imputations
* Stata code to restrict hourly wages to data not allocated by BLS
* Be aware that the allocation indicators are not consistent over time.
* In particular, there is no allocation information at all during Jan 1994 - August 1995.

gen wage_noimpute = wage_noadj
replace wage_noimpute = . if paidhre == 1 & a_earnhour == 1
replace wage_noimpute = . if paidhre == 0 & a_weekpay == 1