<aside> 📖

This glossary explains key methodological concepts that come up frequently when working with economic policy datasets. Understanding these distinctions will help you choose the right data source for your research question and interpret results correctly.

</aside>

Types of Data Sources

Survey Data vs. Administrative Data

Survey data is collected by asking people (or businesses) to self-report information. Examples include the Current Population Survey (CPS), the American Community Survey (ACS), and the Survey of Consumer Finances (SCF). Surveys allow researchers to collect specific variables of interest and can be representative of the population when properly designed. Limitations: subject to non-response bias, recall errors, and top-coding of high incomes.

Administrative data is collected as a byproduct of government or institutional processes — tax records, unemployment insurance filings, payroll data, corporate filings. Examples include IRS Statistics of Income data and Social Security earnings records. Administrative data often covers the full population (no sampling error) but may be limited in scope to the variables relevant to the administrative purpose.

Cross-Sectional Data vs. Panel Data

Cross-sectional data captures a snapshot of a population at a single point in time. The ACS Public Use Microdata Sample (PUMS) is cross-sectional: each respondent is observed once.

Panel data (also called longitudinal data) tracks the same individuals or units over time. The Survey of Income and Program Participation (SIPP) follows respondents for several years. Panel data is essential for studying mobility, transitions (e.g., job loss, wealth changes), and causal questions that require before/after comparisons.

Sample Surveys vs. Universe Data

Sample surveys interview a subset of the population and use statistical weights to generalize to the full population. Most Census Bureau surveys work this way. Always use the provided survey weights when computing population estimates.

Universe (or census) data attempts to cover all units in a population. The Economic Census covers all employer businesses in the U.S.; the Decennial Census aims to count every person. Even universe data has coverage limitations — some businesses or households are missed.


Key Statistical Concepts

Top-Coding

Many public-use survey datasets "top-code" high values — replacing any income or wealth above a threshold with the threshold value. This is done to protect respondent confidentiality. Top-coding makes it difficult to study inequality at the top of the distribution. The Survey of Consumer Finances uses a different method (multiple imputation) specifically to improve coverage of high-wealth households.

Survey Weights

Surveys use complex sampling designs (oversampling certain groups, adjusting for non-response) and provide weights to correct for this. Always apply survey weights when computing means, medians, or totals from survey microdata. In Stata, use svyset and svy: commands. In R, use the survey package.

Margin of Error and Statistical Significance

Estimates from sample surveys have margins of error that reflect sampling uncertainty. The ACS reports margins of error for all estimates; always report them alongside point estimates, especially for small geographies or subgroups. Estimates with large margins of error relative to the point estimate should be treated with caution.

Seasonal Adjustment