Home » Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks

Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks

Artificial Intelligence (AI) dominates today’s headlines—hailed as a breakthrough one day, warned against as a threat the next. Yet much of this debate happens in a bubble, focused on abstract hopes and fears rather than concrete solutions. Meanwhile, one urgent challenge often overlooked is the rise of mental health issues in online communities, where biased or hostile exchanges erode trust and psychological safety.

This article introduces a practical application of AI aimed at that problem: a machine learning pipeline designed to detect and mitigate bias in user-generated content. The system combines deep learning models for classification with generative large language models (LLMs) for crafting context-sensitive responses. Trained on more than two million Reddit and Twitter comments, it achieved high accuracy (F1 = 0.99) and generated tailored moderation messages through a virtual moderator persona.

Unlike much of the hype surrounding AI, this work demonstrates a tangible, deployable tool that supports digital well-being. It shows how AI can serve not just business efficiency or profit, but the creation of fairer, more inclusive spaces where people connect online. In what follows, I outline the pipeline, its performance, and its broader implications for online communities and digital well-being. For readers interested in exploring the research in more depth, including a poster presentation video explaining the code areas and the full-length research report, resources are available on Github. [1]

A machine learning pipeline that employs generative artificial intelligence to address bias in social networks has value to society’s mental well being. More and more, the human interaction with computers is trusting of answers that large language models provide in reasoning dialogue.

Method

The system was designed as a three-phase pipeline: collect, detect, and mitigate. Each phase combined established natural language processing (NLP) techniques with modern transformer models to capture both the scale and subtlety of biased language online.

Step 1. Data Collection and Preparation

I sourced 1 million Twitter posts from the Sentiment140 dataset [2] and 1 million Reddit comments from a curated Pushshift corpus (2007–2014) [3]. Comments were cleaned, anonymized, and deduplicated. Preprocessing included tokenization, lemmatization, stopword removal, and phrase matching using NLTK and spaCy.

To train the models effectively, I engineered metadata features—such as bias_terms, has_bias, and bias_type—that allowed stratification across biased and neutral subsets. Table 1 summarizes these features, while Figure 1 shows the frequency of bias terms across the datasets.

Table 1. Columns used for bias analysis.

Addressing data leakage and model overfitting issues are important in early data preparation stages.

Figure 1. Bias terms occurrences (entire dataset v. stratified dataset v. training dataset).

Supervised learning techniques are used to label bias terms and classify them as implicit or explicit forms.

Step 2. Bias Annotation and Labeling

Bias was annotated on two axes: presence (biased vs. non-biased) and form (implicit, explicit, or none). Implicit bias was defined as subtle or coded language (e.g., stereotypes), while explicit bias was overt slurs or threats. For example, “Grandpa Biden fell up the stairs” was coded as ageist, while “Biden is a grandpa who loves his family” was not. This contextual coding reduced false positives.

Step 3. Sentiment and Classification Models

Two transformer models powered the detection stage:

– RoBERTa [4] was fine-tuned for sentiment classification. Its outputs (positive, neutral, negative) helped infer the tone of biased comments.

– DistilBERT [5] was trained on the enriched dataset with implicit/explicit labels, enabling precise classification of subtle cues.

With the detection model trained at the highest accuracy, comments are evaluated by a large language model and a response is produced.

Step 4. Mitigation Strategy

Bias detection was followed by real-time mitigation. Once a biased comment was identified, the system generated a response tailored to the bias type:

– Explicit bias: direct, assertive corrections.
– Implicit bias: softer rephrasings or educational suggestions.

Responses were generated by ChatGPT [6], chosen for its flexibility and context sensitivity. All responses were framed through a fictional moderator persona, JenAI-Moderator™, which maintained a consistent voice and tone (Figure 3).

Fig. 3. Mitigation Responses to Social Network Comments

Step 5. System Architecture

The full pipeline is illustrated in Figure 4. It integrates preprocessing, bias detection, and generative mitigation. Data and model outputs were stored in a PostgreSQL relational schema, enabling logging, auditing, and future integration with human-in-the-loop systems.

Fig. 4. Methodology Flow from Bias Detection to Mitigation

Results

The system was evaluated on a dataset of over two million Reddit and Twitter comments, focusing on accuracy, nuance, and real-world applicability.

Feature Extraction

As shown in Figure 1, terms related to race, gender, and age appeared disproportionately in user comments. In the first pass of data exploration, the entire datasets were explored, and there was a 4 percent occurrence of bias identified in comments. Stratification was used to address the imbalance of not bias-to-bias occurrences. Bias terms like brand and bullying appeared infrequently, while political bias showed up as prominently as other equity related biases.

Model Performance

– RoBERTa achieved 98.6% validation accuracy by the second epoch. Its loss curves (Figure 5) converged quickly, with a confusion matrix (Figure 6) showing strong class separation.

– DistilBERT, trained on implicit/explicit labels, reached a 99% F1 score (Figure 7). Unlike raw accuracy, F1 better reflects the balance of precision and recall in imbalanced datasets[7].

Figure 5. RoBERTa Models | Training v. Validation Loss, Model Performance over Epochs
Figure 6. RoBERTa Model Confusion Matrix
Figure 7. DistilBERT Models | Training v. Validation Loss, Model Performance over Epochs

Bias Type Distribution

Figure 8 shows boxplots of bias types distributed over predicted sentiment record counts. The length of the box plots for negative comments where about 20,000 records of the stratified database that included very negative and negative comments combined. For positive comments, that is, comments reflecting affectionate or non-bias sentiment, the box plots span about 10,000 records. Neutral comments were in about 10,000 records.  The bias and predicted sentiment breakdown validates the sentiment-informed classification logic.

Fig. 8. Bias Type by Prediced Sentiment Distribution

Mitigation Effectiveness

Generated responses from JenAI-Moderator depicted in Figure 3 were evaluated by human reviewers. Responses were judged linguistically accurate and contextually appropriate, especially for implicit bias. Table 2 provides examples of system predictions with original comments, showing sensitivity to subtle cases.

Table 2. Model Test with Example Comments (selected).

Discussion

Moderation is often framed as a technical filtering problem: detect a banned word, delete the comment, and move on. But moderation is also an interaction between users and systems. In HCI research, fairness is not only technical but experiential [8]. This system embraces this perspective, framing mitigation as dialogue through a persona-driven moderator: JenAI-Moderator.

Moderation as Interaction

Explicit bias often requires firm correction, while implicit bias benefits from constructive feedback. By reframing rather than deleting, the system fosters reflection and learning [9].

Fairness, Tone, and Design

Tone matters. Overly harsh corrections risk alienating users; overly polite warnings risk being ignored. This system varies tone: assertive for explicit bias, educational for implicit bias (Figure 4, Table 2). This aligns with research showing fairness depends on context [10].

Scalability and Integration

The modular design supports API-based integration with platforms. Built-in logging enables transparency and review, while human-in-the-loop options ensure safeguards against overreach.

Ethical and Sociotechnical Considerations

Bias detection risks false positives or over-policing marginalized groups. Our approach mitigates this by stripping personal information data, avoiding demographic labels, and storing reviewable logs. Still, oversight is essential. As Mehrabi et al. [7] argue, bias is never fully eliminated but must be continually managed.

Conclusion

This project demonstrates that AI can be deployed constructively in online communities—not just to detect bias, but to mitigate it in ways that preserve user dignity and promote digital well-being.

Key contributions:
– Dual-pipeline architecture (RoBERTa + DistilBERT). 
– Tone-adaptive mitigation engine (ChatGPT). 
– Persona-based moderation (JenAI-Moderator). 

The models achieved near-perfect F1 scores (0.99). More importantly, mitigation responses were accurate and context-sensitive, making them practical for deployment.

Future directions:
– User studies to evaluate reception. 
– Pilot deployments to test trust and engagement. 
– Strengthening robustness against evasion (e.g., coded language). 
– Expanding to multilingual datasets for global fairness.

At a time when AI is often cast as hype or hazard, this project shows how it can be socially beneficial AI. By embedding fairness and transparency it promotes healthier online spaces where people feel safer and respected.

Images, tables, and figures illustrated in this report were created solely by the author.

Acknowledgements

This project fulfilled the Milestone II and Capstone requirements for the Master of Applied Data Science (MADS) program at the University of Michigan School of Information (UMSI). The project’s poster received a MADS Award at the UMSI Exposition 2025 Poster Session. Dr. Laura Stagnaro served as the Capstone project mentor, and Dr. Jinseok Kim served as the Milestone II project mentor.

About the Author

Celia B. Banks is a social and data scientist whose work bridges human systems and applied data science. Her doctoral research in Human and Organization Systems explored how organizations evolve into virtual environments, reflecting her broader interest in the intersection of people, technology, and structures. Dr. Banks is a lifelong learner, and her current focus builds on this foundation through applied research in data science and analytics.

References

[1] C. Banks, Celia Banks Portfolio Repository: University of Michigan School of Information Poster Session (2025) [Online]. Available: https://celiabbanks.github.io/ [Accessed 10 May 2025]

[2] A. Go, Twitter sentiment analysis (2009), Entropy, p. 252

[3] Watchful1, 1 billion Reddit comments from 2005-2019 [Data set] (2019), Pushshift via The Eye.  Available: https://github.com/Watchful1/PushshiftDumps [Accessed 1 September 2024]

[4] Y. Liu, Roberta: A robustly optimized BERT pretraining approach (2019), arXiv preprint arXiv, p. 1907.116892

[5] V. Sanh, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), arXiv preprint arXiv, p. 1910.01108

[6] B. Zhang, Mitigating unwanted biases with adversarial learning (2018), in AAAI/ACM Conference on AI, Ethics, and Society, pp. 335-340

[7] A. Mehrabi, A survey on bias and fairness in machine learning (2021), in ACM Computing Surveys, vol. 54, no. 6, pp. 1-35

[8] R. Binns, Fairness in machine learning: Lessons from political philosophy (2018), in PMLR Conference on Fairness, Accountability and Transparency, pp. 149-159

[9] S. Jhaver, A. Bruckman, and E. Gilbert, Human-machine collaboration for content regulation: The case of reddit automoderator (2019), ACM Transactions on Computer-Human Interaction (TOCHI), vol. 26, no. 5, pp. 1-35, 2019

[10] N. Lee, P. Resnick, and G. Barton, Algorithmic bias detection and mitigation: Best practices and policies to reduce consumer harms (2019), in Brookings Institute, Washington, DC

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *