Identifying More Opioid Epidemic Service Provider Relationships: Mining Internal Revenue Service XML Data
Jay Colbert
The Polis Center
The Polis Center is leading the opioid work on the Indiana Data Partnership, using social network analysis and cluster mapping to discover networks of connected organizations, as well as to determine the strength of those connections. Our aims are to discover organizations that serve as relationship brokers and identify potential new partners for coalitions to combat the opioid epidemic. We use several methods to obtain this information including interviews, surveys, and web listings of coalitions and board members. Another source is administrative data from government agencies such as the Family and Social Services Administration (FSSA), the Indiana Management Performance Hub (MPH), and the Internal Revenue Service (IRS).
What is IRS 990 data?
The IRS data is the Form 990 data which all tax-exempt organizations must fill out annually. According to the IRS:
“A tax-exempt organization must file an annual information return or notice with the IRS, unless an exception applies. Form 990 is the IRS' primary tool for gathering information about tax-exempt organizations, educating organizations about tax law requirements and promoting compliance. Organizations also use the Form 990 to share information with the public about their programs. Additionally, most states rely on the Form 990 to perform charitable and other regulatory oversight and to satisfy state income tax filing requirements for organizations claiming exemption from state income tax. (https://www.irs.gov/charities-non-profits/form-990-resources-and-tools)”
Did you notice the part about “share information with the public about their programs”? The information we access for this project are lists of an organization’s board members and the organizations they fund. (As a side note, there is no place on Form 990 to list organizations that fund you, only organizations that you fund.)
If you really need to make your eyes water, check out the form in its entirety. But don’t forget Schedule A, which lists funded organizations. Or, Schedule I, which has more info on funded organizations.
How do we use IRS 990 data to identify relationships?
We use 990 data to identify relationships between organizations based on shared board members and funding sources. If two organizations share a board member, it might not indicate a super-close relationship, but we can be very confident that the organizations are at least aware of each other. If you are familiar with our Kumu relationship maps (aka “The Bubbles”) you would see this as a “Networking” relationship. And, if one organization funds another, we can be quite sure there is a close relationship between those two organizations because we all understand keeping an eye on where our money goes and what it is doing.
How do we get the 990 data?
In recent years, IRS 990 data has become more accessible. You can get lists of organizations, but this does not include some of the fun stuff like board members and funders. To get the fun stuff, you have to go to the XML files which are provided from IRS through Amazon and Google. We use a representation built upon the Amazon structure. This leads us to an index of organizations with pointers to their annual XML which leads us to the more interesting data such as employees, board members, and funded organizations.
If you’re still in need of some eye watering, a good example is United Way of Central Indiana’s 2014 XML file.
One of our interns, Saket Talware, wrote some Python scripts to extract the pieces that we wanted which places the data in an easier-to-consume CSV file format.
What were the results?
Our work identified over 48,000 990 filings (2011 – 2018) from almost 12,000 unique organizations in Indiana. These include over 68,000 unique people identified as employees, trustees or board members.
Here’s an example:
Name | Org_Name |
---|---|
confidential | CHILDREN'S BUREAUINC |
confidential | METROPOLITAN INDIANAPOLIS PUBLIC BROADCASTING INC |
confidential | THE JULIAN CENTER INC |
With this information, we draw connections between these organizations on the belief they are at least aware of each other.
Are there any limitations?
There certainly are.
First, we only get xml files for forms that are electronically filed. If an organization is doing it with pen and paper then we are out of luck.
Second, organization names can be messy like the example below where we see this is certainly the same organization with slightly different spellings. Employer Identification Numbers (EINs) help in many cases but not all, so we create a de-duplicated list of organizations. There are also complications in matching these org to the existing orgs in our database.
Name |
---|
ALPHA CHI SIGMA EDUCATIONAL |
ALPHA CHI SIGMA EDUCATIONAL FOUNDATION |
ALPHA CHI SIGMA EDUCATIONAL FOUNDATION INC |
Third, people names can be messy. James Smith looks like a busy fellow with how many boards he is on, but maybe there are multiple people in the data with that name. And is Jim Smith the same guy or maybe somebody else? So we still have some work to do creating a de-duplicated list of people.
What’s next?
This is a work in progress. A future post will discuss what we uncovered when looking at shared funders.