tracking my gender transition through computational linguistics and machine learning

I wrote 299 blog posts in the last decade, roughly half on and half on Produced most of the Badass Data Science content while publicly expressing as a man, and most of the Gender Punk 360 content as a woman. Some articles appear on both blogs—for example this one—and in the analysis described below I account for such duplication.

My speech therapist observed that I successfully employ feminine language in my recent video “radical forgiveness”. This led me to thinking: Has the language I use in my prose evolved as I blossomed into femininity? I detail my attempt to answer this question using mathematical analysis below.

Two Caveats

I make two major assumptions in this analysis, assumptions I will address in future work:

First, I assume my writing skill remained constant throughout the last ten years. Not a great assumption in the long haul but necessary to simplify the math for this “back of the envelope” analysis.

Second, the two blogs cover different subjects, and the first one even contains source code on occasion. This may distort the clustering process described below. Again, ignoring this concern proves acceptable for this “quick-and-dirty” calculation to enable exploration of the problem domain.


I download each of my blog posts and then calculated the part of speech (POS) for each word in the post. After that I computed the frequency distribution of the POSs. I then performed hierarchical clustering using a similarity matrix defined by the dot product of each pair of posts’ POS use frequency distribution vectors. The resulting dendrogram looks like:

I recommend downloading the image to view it at full size.

Each vertical line represents a blog post, and the trees linking the vertical lines indicate the degree of similarity between any two blog posts. For example, in the above image, the cyan and magenta colored posts prove similar but the green and black posts diverge significantly in terms of their POS use frequency distributions. The asterisks indicate posts created after I started expressing publicly as a woman full-time. The colors divide the tree into sections that group similar blog posts. Please note that I chose the grouping threshold manually (but rationally).


By visually inspecting the density of these asterisks for the different color groups we derive an indication of how “feminine” or how “masculine” we might regard each group of blog posts. For example, we see sparse femininity in the green, yellow, and black groups; while we see enriched femininity in the cyan and purple group. The algorithm clearly found little distinction between the posts within the large red group, but even there we visually recognize sections of diminished femininity and sections of enhanced femininity.

So a linguistical difference between my pre- and post-transition writing appears to exist. But is it real? Can we conclude that my prose grew more feminine after my public transition? Not so fast! We must build a model that includes time as a variable to cancel out possible influence of improvement in my writing skill, and then test that model for significance. I’ll save this work for a later date.

ten ways to deliver class (part #1)
flaunt those legs girl!

2 thoughts on “tracking my gender transition through computational linguistics and machine learning”

  1. Okay, so you’re looking at frequency of different parts of speech to determine differences between pre- and post-transition? Is there any previous research on POS as an indicator, or is this just the way you decided to look at differences?

    I think written communication will not be an accurate way to look at gender-based language, since it’s not usually on-the-fly. Plus the different topics means you may be intentionally “masculinizing” when you’re, say, writing for science. Like a type of codeswitching. So perhaps it might be interesting to analyze in that sense?

    I’d be mostly interested in seeing how your actual speaking has changed, tho THAT would be skewed too since you have a speech therapist, I’m assuming leading you towards a certain type of language use.

  2. Hi June!

    Thank you for the wonderful response! You are correct on all accounts!

    I was in a little bit of denial about whether my scientific writing had a “masculine” bias simply due to the subject. (I knew this was a risk but couldn’t bring myself to admit it to myself). From the beginning, I worked to make the writing in Badass Data Science gender neutral in style and substance. Moreover, my style was that of scientific journalism (e.g. Wired Magazine) rather than that of a technical journal. I was afraid of facing the idea that the very nature of science—which I deeply love—as I learned to express it within the patriarchy that is academia and the workplace would naturally come out as masculine in my writing about it.

    Thank you for pointing out that the “emperor has no clothes”!

    I also agree that verbal communication, a day-to-day sample, would make better research data for this test. Unfortunately I don’t have the data though.

    I know there are many metrics used in such a comparison by academics. I only chose part of speech because I’m learning Natural Language Processing analysis for a new job and just learned this particular method. So I chose this method for my comparison as an experiment. It fits within the realm of “exploratory analysis” as per our discussion during the Grrl on Grrl Podcast interview. I apologize that I can’t give you a better answer!



Leave a Reply

Your email address will not be published. Required fields are marked *