Sharing taxpayer data with ICE: The risks to Americans are greater than they appear
On April 8, many media outlets reported that there's now a data sharing agreement between the Internal Revenue Service (IRS) and Immigration and Customs Enforcement (ICE) to share data from taxpayers who have Individual Tax Identification Numbers (ITINs). ITINs allow U.S. residents who lack a Social Security number and are ineligible to receive one to pay taxes on their earnings. We can safely assume that many or most of these people are, or were at one time, undocumented. This sort of data sharing is generally against the law.
Breaking all this down:
- Taxpayers who are undocumented don't have access to Social Security, Medicare, and most other benefits, even though they follow the law and pay tens of billions of dollars in taxes each year.
- The IRS database of taxpayers with ITINs also has other personal details, including address and other contact information.
- ICE will now have access to these data.
- ICE deports people to an El Salvador prison and claims it can't retrieve people it sent there in error. The Supreme Court has said they must "facilitate" the release of people wrongfully imprisoned, thankfully.
ICE will use this database to find and deport undocumented people who choose to follow the law by paying taxes. But this dataset (like all big datasets) is riddled with technical flaws, and there's a high risk that ICE will end up deporting the wrong people—then leaving them in an international prison indefinitely, in blatant violation of their Fifth Amendment rights.
The risk is in the details
First, let's reiterate the stakes here: ICE is deporting people to El Salvador and claims it can't retrieve people it sent there in error. Any errors they make in matching records here could result in the wrong people being sent to a prison in another country. This approach is haphazard and dangerous because there are so many opportunities for errors in processing:
Risk: These databases may not be able to store your real name
Data stored in mainframes or older databases tend to have different constraints than data in modern systems. The IRS created ITINs in 1996, so we can assume that they are stored in an older database. Years ago, databases limited the number of characters in a data field, sometimes to as few as 64 characters for your whole name. If you've ever see a long name cut off on a form, this may be why.
Also, some systems don't support accents or apostrophes or other special characters. They may also not support CaseSenstivity. So José may be stored as Jose or JOSE or JOSÉ. Jon'll may become Jonll or JONLL.
Finally, for names that don't fit these restrictions, sometimes government employees are the ones trying to figure out the best way to enter an unsupported name into the database, and different systems may store the same name in different ways. This all means that the name that is ultimately recorded in the database may not be easily matched to the person's ID.
These limitations most often get stuck on long names and names with characters that aren't in the English alphabet. You might ask why these systems are still in service, and you're right to ask. These databases should be updated, and often they are included in agency roadmaps for modernization.
Risk: Matching data in databases is hard, period
In this ICE-IRS matching example, someone will need to match people from one database to another. This can be hard even in the best of times. How many people do you know with common last name Jones, Lee, or Gonzalez?
Now, let's take what I bet is a large dataset of people from both systems. How do you match the many people with the same name in one system to the many people with the same name in the other? My name is reasonably common, and at one point another person with my name, with a different spelling, worked at the same company as me. We got emails intended for each other all the time, and one time, I got her W2 at my house! We knew each other by then, so we worked it out, but in the context of a large, national database, possibly with incomplete data or incorrect names, we can't hope for this manual correction to be made, person by person, by ICE agents.
Good database practice is to assign a "unique identifier" for each entry, a field or fields that can be relied on to be unique. What happens if the ICE and IRS databases have different unique identifiers? What if the only overlapping field is the names? Even if they have additional information like address, it's common for parents and children with the same name to live in the same house. How will they match people accurately?
Risk: They are being asked to do this fast
We've seen declarations that DOGE is going to modernize a mainframe or build a better data sharing method in 30 days — these are hopelessly naive goals for systems so complex. Knowing that the data may not reflect a person's real name, and the two databases may not support easy matching, what do you expect to happen when we add time pressure to the mix? It's human nature to perform worse under pressure, so working quickly raises the possibility of errors.
Risk: They are likely using AI to help with linking data
Because DOGE is infatuated with AI and they want to do this fast, I expect that they're using AI agents or large language models (LLMs) to help. We all know these models make errors, misinterpret directions, and create content that is harmful or incorrect. They can still be enormously helpful in applications where you can afford errors or time spent fact checking, but DOGE isn't known for measuring twice before they cut.
The other challenge with LLMs like ChatGPT, Claude, Gemini, and Copilot is that while it's possible to test the content and solutions created by these algorithms, it's time-consuming and requires manual work.
Risk: They might not match the data at all
There's a chance they might not even care about matching data between IRS and ICE databases and will start going door to door based on taxpayer data, regardless of immigration status. Immigration actions have been carried out without care or caution so far, so in this worst case scenario, everything I've said about technical errors may be moot, and even legal US residents would be at risk of deportation.
The bottom line
To repeat, it's against the law to share data like this. Complicating matters, the risks are multiplied by the databases and the speed and tools these people choose to use.
This is a huge danger, a horrible precedent, and something we cannot sit by and let happen.