On Siri and Recognitive Violence
By Joshua A. Halstead
As a disabled person who relies on speech recognition software to complete a range of daily writing tasks (from emails to book chapters), I am no stranger to the universe of voice assistants, having cut my teeth on Dragon Dictate in the ’90s.
Though I’m used to the software knowing my voice, that it now knows my location is uncanny. This discovery occurred on a morning stroll when Siri spelled “Urth Caffé” correctly, rather than, as was expected, scribing “earth café.” This is when I realized that my assistant had turned into a stalker.
In this short article, I argue that Apple’s decision to integrate user location data into Siri’s speech recognition system created a product that contributes to gentrification and could further marginalize disabled people.
Before we continue, it is important to provide a general sketch as to how Siri converts speech into text. Like other popular voice assistants, Siri leverages automatic speech recognition (ASR) technology. At its most basic level, ASR makes words out of waveforms. When a phrase like, “Hey, Siri,” is spoken, ASR slices this audio into 10-millisecond chunks. Each waveform is then mapped onto a spectrogram, which visualizes it relative to the others. Because waveforms vary based on accent, dialect, and speech rhythm, the composition must be vetted through an encoder, which learns and sorts them according to contextual patterns. Finally, the encoded waveforms are passed to a decoder that methodically assigns characters and words to these patterns.
ASR has become far more accurate over the last decade due to advances in machine learning. However, its effectiveness in recognizing small, local entities is still limited. Correctly translating X (a waveform) into Y (e.g., an entity) is among the challenges in artificial intelligence that have today produced a multi-billion dollar industry (with an annual growth projected at nearly 15%), and multinationals like Google, Microsoft, and Apple are vying for victory.
It was deemed necessary, then, for Apple’s Speech Recognition (SR) team to mobilize user’s location data to improve the Siri product and pull ahead of the competition. To do this, the distances between users and their searched points of interest (POIs) were studied by reviewing over 4.6 million Siri logs in 2018. Focusing exclusively on the United States, Apple researchers found that approximately 70% of user queries were directed within a 50-mile radius of users’ locations. User’s reliance on Siri in local regions caused concern for its unreliability with recognizing small businesses, often absent in the training data. Therefore, Apple SR decided to construct region-specific language models (Geo-LMs) based on combined statistical areas (CSAs) as designated by the US Census Bureau. CSAs were chosen because of their pulse on socioeconomic links within regions, accommodating the team’s focus on improving Siri’s recognition of small businesses. In regular ASR systems, as explained above, encoded waveforms are passed through a decoder to produce words from audio patterns. In Siri’s Geo-LM system, user location data tells ASR which region-specific language model to load in order to increase the likelihood of small business recognition. With this approach, Apple improved Siri’s comprehension of local POIs by about 18%, tying Apple’s continued success as a company, not just to the development of technology, but to the development of land through capital.
What does it mean when our devices recognize us more accurately when we utter the dialect of consumerism? Further, what might be the costs of being “recognized,” when recognition is no longer a meager social, but multi-layered sociotechnical, interaction between subjects, land, and multinational corporations tied up in market forces? Such questions deserve more attention than can be afforded in blog format. Suffice it to say that recognitive violence, that is, a systematic categorization, valuation, and transformation of voice, identity, and land, motivated by capital gain, could plausibly be advanced by voice assistants like Siri. Insofar as Apple benefits from Siri’s recognitive accuracy, and this accuracy is conditioned on the enmeshment of language and commerce, it is possible to underscore the company’s vested interest, whether explicit or tacit, in gentrification.
This interest becomes apparent in Apple’s partnership with Yelp. Announced with the launch of Siri in 2011, Yelp’s recognition of local businesses has long influenced the development of Siri’s ASR Geo-LM system. For instance, when the SR team announced improvements to the system in 2018, the training data contributing to an 18% increase in comprehension integrated Yelp’s top 1,000 entities, based on number of reviews, in eight large cosmopolitan regions. As Los Angeles was one of those regions, it’s no wonder that Siri recognized my uttering “earth café” as “Urth Caffé,” the most highly-reviewed restaurant on my block. Yelp-trained Siri paired my location with its ASR Geo-LM and correctly matched waveforms to local business.
However, when I nervously tested Siri on other locations around my neighborhood, Urth Caffé, a restaurant with 3,804 Yelp reviews, was the only location that Siri recognized. That Siri did not recognize the restaurants with less reviews made me question how Apple’s partnership with Yelp might impact the trajectory of local development in my neighborhood, which is already quite gentrified.
Pascale Joassart-Marcelli is Professor of Geography at San Diego State University with a focus on the intersections of food, ethnicity, and place. For the past decade, her attention has been attuned to the transformation of San Diego neighborhoods like Barrio Logan and City Heights, hosting large immigrant communities, through social media platforms like Yelp.
In a recent op-ed, which draws on her analysis of 2,400 Yelp reviews of the best- and worst-rated restaurants in these neighborhoods, Joassart-Marcelli underscores that most highly-rated ethnic restaurants in Barrio Logan and City Heights are located “in areas where the foodscape is in the process of becoming more cosmopolitan and oriented toward white consumers.” As predominantly white Yelpers flood ethnic foodscapes in search of “exotic” food, Joassart-Marcelli concludes, food prices rise, rent skyrockets, and cultural expression stagnates, slowly corralling the ethnic into a sociotechnical, normative order. It follows, then, that Apple’s partnership with Yelp, justified as giving “voice” to small businesses, is far from innocent and shows how speech recognition can be critically examined as a place-making practice. Additionally, it goes without saying that gentrification is a disability issue. The conflux of aggressive urban development policies and the financial limitations imposed on disabled people by public programs such as Social Security Disability Insurance (SSDI) can lead to socioeconomic disparities in housing, safety, food security, and more.
This brief analysis seeks to spotlight only a few sociopolitical aspects of speech recognition systems through the lens of Siri. By integrating user location data into Siri’s ASR platform, Apple had to “bring under one roof” ideologies of population management, a social media platform, and market demands, amid a much broader human and non-human actor network. What can be observed is the recognitive violence this program has inflicted on identity formation, linking the “ear” of Siri overwhelmingly toward the subject position of the consumer. What can be speculated is the potential contributions this orientation may have on the representation and transformation of public space. If Siri affords me a certain degree of access in my daily life, it is not without disaffording many of my neighbors in the near future. Today, when I found out that a beloved local business just two doors down from Urth is closing, I couldn’t help but wonder if that future is closer than we think.
Joshua A. Halstead is an epistemic activist working at the intersection of critical disability studies, design pedagogy, and community organizing.