Infectious Diseases - Data Cleanup Project
Background Information
The Carter Center is dedicated to advancing human rights and alleviating suffering and one of their focuses is on minimizing infectious diseases. To do this, they collect data from geographic units within each country, but the data is often given in the native languages of these countries, leading to conflicting translations. A similar project was previously undertaken for the country of Sudan, and this semester the Carter Center is looking for help with their Ethiopian data. The primary challenge is due to the lack of resources for Amharic and the absence of a standard English transliteration. To address this, a tool, along with an additional GUI, was developed to transliterate a list of region, zone, and woreda inputs using a standardized list of target mappings.
Project Demo Videos
Google Colab Ethiopian Tool Demo
Anvil Website Demo
Project Evolution
Over the semester, we worked on refining a transliteration tool for the Carter Center's Ethiopian data. Initially, we aimed to develop a hardcoded solution for Ethiopian geographical mappings and a general solution that would work for any country or language. But due to the client's requests, we prioritized incorporating a user-friendly GUI built with Anvil for non-technical users, and directed our attention away from building the generalized tool. We started implementing the generalized tool near the end of the semester, but haven't finished.
Through our iterations and continuous communication with the client, we enhanced usability of the tool and made it fitting for the client's request. Throughout this project, keeping an open mind and adapting to new challenges enabled us to better serve the client's needs.
Major Project Goals
- (Primary Goal) Have a working tool that uses hardcoded mappings of the Ethiopian woredas (district) names to output the standardized spellings of a list of input into an excel file and marks cases that should be reviewed by the user.
- (Secondary Goal 1) Have a general working tool so that the user can input any standardized mapping file for any language/country they are working with and get an output of the standardized spellings in an excel file that marks cases that should be reviewed by the user.
- (Secondary Goal 2) Have a working tool that takes into consideration changes in district boundaries and is able to map previous boundaries to the current boundaries, with the cases that should be reviewed by the user marked in the excel file.
Additional goals added during the semester:
- Front-End GUI: At the client’s request, we prioritized developing a GUI using Anvil to make it easier for non-technical users to successfully utilize the tool.
- Refactoring Code: We refactored the Google Colab code to streamline its structure and make it easier to create the general solution for any country in the future.
Next Steps
- Finish generalized tool (can be used for any country): The next step would be to finish the general tool that can be used for any country by adapting the Ethiopian tool for varying geographic structures beyond the Ethiopian context. This task was started but not completed. We explored strategies to accommodate any number of geographic layers, such as regions, zones, woredas, districts, or additional hierarchical levels. The generalized tool will allow users to simply enter their input file and target mapping for any data set they have (the files still need to follow the same structure every time). The tool will then produce outputs that adhere to the same standardized format, highlighting unsure cases for review.
- District Boundary Updates and Mapping: After that is complete, the client has expressed interest in incorporating functionality that accounts for changes in district boundaries that are reported online. This would involve pulling updated boundary data directly from online resources. The tool would then map previous district boundaries to the new ones, which would ensure that the tool is taking the current boundaries into account. Just like the current tool, the updated version will flag cases that require user review by highlighting them.
Team Reflection
What Worked Well
- Text Communication: Having a text chain where we communicate multiple times a week about what we are individually working on, blockers we have run into, or guidance on what to work on next
- Shared Folders: Having a shared folder for data files and other documents we are turning in so we all can see and use the information we have
- Meeting Notes: Taking meeting notes so we can look back on things we need to work on and people who were not available for the meeting can review what was discussed
- Zoom Meetings with Clients: Having bi-weekly meetings with Jenna and Emily to keep them updated, show them demos, and ask questions as they come up
- Weekly Meeting with Team: Friday work jam sessions following meetings with Emily and Jenna
- Splitting up Tasks: Dividing responsibilities among team members helped us make efficient progress and keep accountability
What Could Be Improved
- Version Control: version control was an issue for our team while working with code in both Google Colab and Anvil. Google Colab does not offer a true git or version control system, which sometimes led to outdated files/code on different team members' computers. While we did use Git for the Anvil portion, we found that the Anvil Git implementation was a little clunky.
- Personalized Website Domain Name: Having a personalized domain name would have been a nice addition to the project, but I am not sure if it was financially achievable. Purchasing a custom domain name costs money, and for the purposes of our project, the Carter Center was alright with a free, randomized domain name generated by Anvil.
Advice for Future Teams
- Frequent Communication, Meetings, and Updates: Maintaining regular communication helps everyone stay aligned on project goals, progress, and issues that are being encountered. Frequent updates help identify potential roadblocks early and allow the team to overcome those roadblocks relatively quickly, so that is something we want to keep for all future projects.
- Having Set Times for Meetings: Scheduling consistent meeting times creates structure which makes it easier for team members to plan and prioritize their work.
- Having a Shared Space for All Files: Having one place for storing files (such as Google Drive) ensures that all team members have easy access to resources, reducing confusion.
- Taking Meeting Notes: Documenting meetings with the client creates a reliable reference point for the team. This minimizes misunderstandings and ensures that the things we need to complete are clearly defined.
Resources for the Next Team