https://indiana-my.sharepoint.com/:f:/g/personal/zhu11_iu_edu/EkFJ17EHX59LsO1Ekfc0TPkBi7CipeIHccd4wjb0CRhjzQ?e=qlt6aH
Task 4 contains 500,000 hashed records in dataset A and 500,000 hashed records in dataset B. In our published training dataset, there are 11 columns:
First Name | Last Name | Gender | SSN | Birth Date | Phone | Address | Share | error place |
---|
In the csv file, the first 9 columns are the features of the record. each feature has been hashed by Sha256
.
There are 10% of data from dataset A and B are common. Please note that except “gender” and “SSN”, each feature may have a different error rate(from 2% to 35%). And similarly the missing value. Detail shows in below:
% missing value | % Error rate | |
---|---|---|
First name | 0 | 2 |
Last name | 0 | 2 |
SSN | 70 | 0 |
Birth Date | 15 | 5 |
Email address | 40 | 20 |
Tel phone # | 25 | 35 |
Address | 5 | 15 |
State | ||
10 | 10 | |
Gender | 1 | 0 |
The last two columns will not appear in the final competition dataset, it is only used for partitioners to validate their algorithms.
Column “Share”:
This column has contents which is either True
or false
True
represents this record is a common record in both data A and B. False
means not
Column “error”: The element in the “error” column could be features of the record. Represent which feature has error in this record.
The missing value has also been hashed! You will find the hash value of “empty” by the most commonly element in the .csv file.