Weightipy¶
Weightipy is a modernized, lightweight, and high-performance library for weighting survey data using the RIM (iterative raking) algorithm. It is a streamlined fork of Quantipy3.
Why Weightipy?¶
- Fast: Runs up to 6x faster than Quantipy.
- Modern: Supports Python 3.7+ and the latest Pandas/Numpy versions.
- Flexible: Supports simple raking, segmented (nested) weighting, and loading targets from various census data formats.
- Lightweight: Removed heavy dependencies and reporting overhead to focus purely on the weighting engine.
Installation¶
Quick Start¶
Weightipy creates a new column of weights that aligns your dataset's distribution with specific targets.
1. Simple Weighting (Manual Dictionary)¶
If you have a simple list of percentages, you can define them in a dictionary.
import weightipy as wp
import pandas as pd
# Your survey data
df = pd.read_csv("my_survey.csv")
# Define targets (percentages)
targets = {
"age_group": {"18-24": 10.0, "25+": 90.0},
"gender": {"Male": 49.0, "Female": 51.0}
}
# Create schema and weight
scheme = wp.scheme_from_dict(targets)
df_weighted = wp.weight_dataframe(df, scheme, weight_column="weights")
# Check efficiency
eff = wp.weighting_efficiency(df_weighted["weights"])
print(f"Weighting Efficiency: {eff:.2f}%")
2. Segmented Weighting (Nested)¶
A common requirement is to weight specific groups differently (e.g., weight Age and Gender within Region, while also correcting the size of the Regions themselves).
You can now do this in a single step using a Segmented Scheme:
targets = {
"segment_by": "region",
"segment_targets": {"North": 40.0, "South": 60.0}, # Global proportions
"segments": {
"North": {
"age_group": {"18-24": 15.0, "25+": 85.0},
"gender": {"Male": 50.0, "Female": 50.0}
},
"South": {
"age_group": {"18-24": 10.0, "25+": 90.0},
"gender": {"Male": 48.0, "Female": 52.0}
}
}
}
scheme = wp.scheme_from_dict(targets)
df_weighted = wp.weight_dataframe(df, scheme)
Working with Census Data¶
Manually typing targets is tedious. Weightipy provides tools to generate schemas directly from census tables or reference datasets.
Scenario A: You have "Tidy/Long" Aggregates¶
Common with US Census API, Eurostat, tidycensus, or SQL exports.
If your target data looks like this:
| Region | Variable | Category | Count |
|---|---|---|---|
| East | Age | 18-24 | 500 |
| East | Gender | Male | 450 |
Use scheme_from_long_df:
df_census = pd.read_csv("census_long_format.csv")
scheme = wp.scheme_from_long_df(
df=df_census,
col_variable="Variable", # Column containing 'Age', 'Gender'
col_category="Category", # Column containing '18-24', 'Male'
col_value="Count", # Column containing the population count
col_filter="Region" # Optional: Split schema by Region
)
df_weighted = wp.weight_dataframe(df, scheme)
Scenario B: You have Reference Data (Wide/Detailed)¶
Common when you have a "Golden Standard" dataset, a detailed frequency table of all combinations, or raw microdata.
If your target data looks like this (one row per combination, or one row per combination of demographic variables):
| Region | Age | Gender | Population_Count |
|---|---|---|---|
| East | 18-24 | Male | 250 |
| East | 18-24 | Female | 260 |
Use scheme_from_df. Weightipy will group and sum the data to calculate the correct distributions.
df_reference = pd.read_csv("census_detailed.csv")
scheme = wp.scheme_from_df(
df=df_reference,
cols_weighting=["Age", "Gender"],
col_freq="Population_Count",
col_filter="Region" # Optional: Weight Age/Gender within Region
)
df_weighted = wp.weight_dataframe(df, scheme)
Data Validation¶
Before applying weights, it is highly recommended to validate that your survey data aligns with your schema. Weightipy can detect critical errors (e.g., a category exists in the census but is missing in the survey) and warnings (e.g., targets not summing to 100%).
# Get a report of all issues (does not raise exception)
report = wp.validate_scheme_dict(df, targets, raise_error=False)
if not report.empty:
print(report)
# Columns: [group, variable, issue_type, severity, details]
# Or strict validation (raises ValueError on Critical errors)
wp.validate_scheme_dict(df, targets, raise_error=True)
Serialization & Advanced Usage¶
For advanced workflows—such as manual overrides, multi-threading, or network transmission—it is often better to work with the raw configuration dictionary rather than the Rim class directly.
Weightipy exposes the intermediate extraction functions for this purpose. These return a JSON-serializable dictionary.
# 1. Extract raw dictionary from data
config = wp.scheme_dict_from_df(df_census, cols_weighting=..., col_freq=...)
# 2. Modify manually (e.g., fix a specific target)
config['segments']['North']['age_group']['18-24'] = 12.5
# 3. Serialize (safe for network or threading)
import json
payload = json.dumps(config)
# 4. Reconstruct Scheme later/elsewhere
scheme = wp.scheme_from_dict(config)
API Reference¶
| Function | Description |
|---|---|
weight_dataframe |
Main entry point. Weights data by a scheme and appends a weight column. |
weight_df |
Alias to weight_dataframe |
weighting_efficiency |
Calculates the efficiency of the weights (Kish's effective sample size). |
scheme_from_dict |
Creates a scheme from a python dictionary. Supports both simple (flat) and segmented (nested) structures. |
scheme_from_long_df |
Creates a scheme from "Tidy" aggregate data (Variable/Category/Value columns). |
scheme_from_df |
Creates a scheme from a reference dataframe (Microdata or Detailed Aggregates). |
scheme_dict_from_df |
Extracts the raw configuration dictionary from a reference dataframe. Useful for debugging, manual adjustments, or serialization. |
scheme_dict_from_long_df |
Extracts the raw configuration dictionary from Tidy/Long data. |
validate_scheme_dict |
Validates a survey dataframe against a scheme dictionary. Checks for missing categories, NaNs, and target sums. |
validate_scheme |
Validates a survey dataframe against a compiled Rim object. |
Rim |
The underlying class for defining complex schemas. |
WeightEngine |
The engine that runs the iterative algorithm. Useful for advanced manipulation. |
Contributing¶
We welcome volunteers!
- Run Tests:
python3 -m pytest tests - Development: Please include a test case with any pull request.
Origins & Credits¶
Weightipy is based on Quantipy.
- Quantipy Creator: Gary Nelson (Datasmoothie)
- Contributors: Alexander Buchhammer, Alasdair Eaglestone, James Griffiths, Kerstin Müller (YouGov), Birgir Hrafn Sigurðsson, Geir Freysson.
- Weightipy: Remi Sebastian Kits