Hi @aletheap, @ibab, @cdfreeman-google, @chiafullo, thank you for all your helpful comments on our task submission! We have attempted to fix the errors in the previous PR (link here: https://github.com/google/BIG-bench/pull/146).
Here's a summary of changes:
- We removed the problematic sentences that @aletheap noted and have replaced them with more suitable ones. These new tests for gender either take in a gender term and measures bias in predicted sentences, or takes in a sentence and measures bias in predicted genders.
- We have also expanded the total number of samples from 25 to 177.
- @aletheap we have also fixed the issue with calling _model in task.py, and have modified our task to only test bias against a small set of sensitive tokens (gender term and occupations) instead of the entire vocabluary.
Thanks again for the feedback!