Tech & Society

Try this simple fix if Siri keeps getting your name wrong

I’ve had an iPhone for ten years, and I love it. Unlike some people, I really enjoy Siri and use it frequently. But after ten years, Siri hasn’t figured out that when it transcribes my texts, it should know my wife’s name is not Aaron, it’s Erin. I forgive the speech-to-text implementation, which is resource-intensive, but after I corrected that mistake once and sent a revised text, that correction should have been stored in a correction history on my phone—a small file used by a post-processing transformer model, along with other clues, to make this mistake much less likely. I know that to call the iPhone’s speech to text functionality Siri is oversimplifying, but that’s how my kids think of the ‘AI in my iPhone.’

Speech-to-text systems often struggle with homophones—words that sound the same but have different spellings and meanings. These errors can be frustrating, especially when they affect personal names or commonly used terms. The key to fixing this problem lies not in overhauling the speech recognition engine but in a lightweight, post-transcription text processing layer that adapts to user corrections over time. Here’s the PyTorch-based code I designed to address this.

It’s super compact and easy to deploy on a phone after compiling for mobile. I know that behind Siri is a highly complex set of chained models, so this code could be used just to provide a new feature as input to those models, a score that helps personalize the transcription when particular homophones arise. But it would be simpler to use this as a post processing layer.

This doesn’t have to wait for a new phone release to be deployed. It would make life better for me in the next update Apple releases for my iPhone.

The Core Idea

This approach focuses on three main elements:

  • Correction History: Stores previous user corrections, prioritizing words the user has explicitly fixed before.
  • Frequent Contacts: Tracks frequently used words or names, assigning a higher likelihood to those more commonly used.
  • Contextual Analysis: Uses Natural Language Processing (NLP) to analyze the surrounding text for clues that help disambiguate homophones.

The system calculates a likelihood score for each homophone candidate based on these three factors and selects the most likely correction. Below is the Python implementation broken into sections with explanations.

Loading the Homophones Database

The first step is creating or loading a database of homophones. These are word pairs (or groups) that are likely to be confused during transcription.

# Homophones database
homophones_db = {
    "Aaron": ["Erin"],
    "bare": ["bear"],
    "phase": ["faze"],
    "affect": ["effect"],
}

This is a simple dictionary where the key is the incorrectly transcribed word, and the value is a list of homophone alternatives. For example, “phase” can be confused with “faze”. Later, this database will be queried when an ambiguous word is encountered.

Tracking Correction History

The code tracks user corrections in a dictionary where each key is a tuple of (original_word, corrected_word) and the value is the count of times the user corrected that error.

Correction history tracker

# Correction history tracker
correction_history = {
    ("phase", "Faye's"): 3,
    ("bear", "bare"): 2,
}

If the user corrects “phase” to “Faye’s” three times, the system prioritizes this correction for future transcriptions.

Frequent Contacts

Another factor influencing homophone selection is how often a particular word is used. This could be personal names or terms the user frequently types.

# Frequent contact tracker
frequent_contacts = {
    "faye": 15,
    "phase": 5,
    "erin": 10,
    "aaron": 2,
}

The system gives more weight to frequently used words when disambiguating homophones. For instance, if “faye” appears 15 times but “phase” appears only 5 times, “faye” will be preferred.

Contextual Analysis

Context clues are extracted from the surrounding sentence to further refine the selection. For example, if the sentence contains the pronoun “she”, the system might favor “Erin” over “Aaron”. from transformers import pipeline

Load an NLP model for context analysis

from transformers import pipeline

# Load an NLP model for context analysis
context_analyzer = pipeline("fill-mask", model="bert-base-uncased")

def detect_context(sentence):
    """Detect context-specific clues in the sentence."""
    pronouns = ["he", "she", "his", "her", "their"]
    tokens = sentence.lower().split()
    return [word for word in tokens if word in pronouns]

This function scans the sentence for gender-specific pronouns or other clues that might indicate the intended meaning of the word.

Calculating Likelihood Scores

Each homophone candidate is assigned a likelihood score based on:

  1. Past Corrections: Higher weight (e.g., 3x).
  2. Frequent Usage: Medium weight (e.g., 2x).
  3. Context Matching: Lower weight (e.g., 1x).
def calculate_likelihood(word, candidate, sentence):
    """Calculate a likelihood score for a homophone candidate."""
    correction_score = correction_history.get((word, candidate), 0) * 3
    frequency_score = frequent_contacts.get(candidate, 0) * 2
    context = detect_context(sentence)
    context_clues = homophones_db.get(candidate, [])
    context_score = sum(1 for clue in context if clue in context_clues)
    return correction_score + frequency_score + context_score

This score combines the three factors to determine the most likely homophone.

Disambiguating Homophones

With the likelihood scores calculated, the system selects the homophone with the highest score.

def prioritize_homophones(word, candidates, sentence):
    """Prioritize homophones based on their likelihood scores."""
    likelihoods = {
        candidate: calculate_likelihood(word, candidate, sentence) for candidate in candidates
    }
    return max(likelihoods, key=likelihoods.get)

def disambiguate_homophone(word, sentence):
    """Disambiguate homophones using likelihood scores."""
    candidates = homophones_db.get(word, [])
    if not candidates:
        return word
    return prioritize_homophones(word, candidates, sentence)

This process ensures the most appropriate word is chosen based on history, frequency, and context.

Processing Full Transcriptions

The system processes an entire sentence, applying the disambiguation logic to each word.

def process_transcription(transcription):
    """Process the transcription to correct homophones."""
    words = transcription.split()
    corrected_words = [disambiguate_homophone(word, transcription) for word in words]
    return " ".join(corrected_words)

Full Example Workflow

# Example transcription and correction
raw_transcription = "This is phase one plan."
corrected_transcription = process_transcription(raw_transcription)

print("Original Transcription:", raw_transcription)
print("Corrected Transcription:", corrected_transcription)

# Simulate user feedback
update_correction_history("phase", "faye")
print("Updated Correction History:", correction_history)
print("Updated Frequent Contacts:", frequent_contacts)

Updating Feedback

When the user corrects a mistake, the correction history and frequent contacts are updated to improve future predictions.

def update_correction_history(original, corrected):
    """Update correction history and frequent contacts."""
    correction_history[(original, corrected)] = correction_history.get((original, corrected), 0) + 1
    frequent_contacts[corrected] = frequent_contacts.get(corrected, 0) + 1
    frequent_contacts[original] = max(0, frequent_contacts.get(original, 0) - 1)

Example transcription and correction

Original Transcription: This is phase one plan.
Corrected Transcription: This is Faye's one plan.
Updated Correction History: {('phase', 'Faye's'): 4}
Updated Frequent Contacts: {'Faye's': 16, 'phase': 4}

Conclusion

This lightweight text-processing layer enhances the accuracy of speech-to-text applications by learning from user corrections, leveraging frequent usage, and analyzing context. It’s compact enough to run on mobile devices and adaptable to individual user needs, offering a smarter alternative to traditional static models. With minimal effort, Apple—or any other company—could integrate this functionality to make virtual assistants like Siri more responsive and personalized.


This article was originally published by Philip Hopkins on HackerNoon.

HackerNoon

Recent Posts

Nvidia, AI, and Bitcoin Take Center Stage in 2024 Tech Trends

This is the second-last edition of this year's "Tech, What the Heck!?" newsletter. To commemorate…

5 hours ago

China and Vietnam’s Digital Harmony: The Formula for Tech Complacency

Imagine you’re a fish who’s given up on the idea that a fishing net is…

2 weeks ago

Cybersecurity in the age of Digital Transformation 

The intersection of opportunity and vulnerability has never been more pronounced in today’s era where…

2 weeks ago

How partnerships are driving a new era of dynamism for the global tech industry 

Although Europe’s tech sector has helped to deliver solutions that span the breadth from fintech…

2 weeks ago

Middle managers are drowning, but AI offers a multifaceted solution

With AI rapidly transforming the workplace, the role of middle managers has never been more…

2 weeks ago

Why organizations must accelerate data initiatives in 2025 to drive meaningful business growth

As we look ahead to 2025, the majority of organizations understand the importance of using…

2 weeks ago