encoding fashion rules into mathematical data structures (part one)

As we build our fashion recommendation engine, we seek rules to populate it with. With few exceptions (e.g. [1]), we find these rules encoded in prose or infographic form, rather than a semantic web form suitable for computation. For example, [2] provides written advice on dressing fabulously for a “rectangular” women’s body type. The writers meant this document for a human reader, not a computer program.

However, we can’t scale a process consisting of manual extraction of rules to the level we would like to achieve in this project, so we turn to natural language processing to extract rules from texts in an automated fashion. We begin by identifying parts of speech and the syntax relationships between words in sentences. For example, consider the following two fashion rules from [2]:

  • If you are a heavy or tall rectangle, choose a big bag.
  • If you are a petite rectangle, choose a petite bag.

We then create a directed graph with words as nodes, each with an attribute indicating its part of speech, and edges indicating the syntactic relationships between the nodes (e.g., “heavy” is a modifier of “rectangle”). We also add edges to specify the direction of sentence flow. Visualizing the above two sentences in this form using Neo4j [3] yields:

Next Steps

In the next phase, we plan to automatically derive computationally useful IFTHENELSE rules from such mappings. For example, the above two sentences express in IFTHENELSE form as:

  • IF rectangle AND (heavy OR tall) THEN choose a big bag
  • IF rectangle AND petite THEN choose a petite bag

Once we form a comprehensive set of such rules, we will load them into an expert system or related system to enable fuzzy reasoning on the rules, enabling custom fashion recommendations!

After this, we will come up with a way to reconcile similar recommendations. For example, suppose we find the following two IFTHENELSE rules from two different sources:

  • IF rectangle AND (heavy OR tall) THEN choose a big bag
  • IF rectangle AND (heavy set OR tall) THEN select a big bag

These say the same thing. We will devise a way to combine them into one recommendation such that the weight (value) of the recommendation doubles due to its backing by two distinct sources.



References

  1. Vogiatzis, D. Pierrakos, G. Paliouras, S. Jenkyn-Jones, B.J.H.H.A. Possen, Expert and community based style advice, Expert Systems with Applications, Volume 39, Issue 12, 2012, Pages 10647-10655, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2012.02.178. (http://www.sciencedirect.com/science/article/pii/S0957417412004411) Keywords: Style advice; Recommender system; Fashion ontology; User modeling
  2. http://www.styled247.com/rectangle-body-shape
  3. https://neo4j.com

tracking my gender transition through computational linguistics and machine learning

I wrote 299 blog posts in the last decade, roughly half on badassdatascience.com and half on genderpunk360.com. Produced most of the Badass Data Science content while publicly expressing as a man, and most of the Gender Punk 360 content as a woman. Some articles appear on both blogs—for example this one—and in the analysis described below I account for such duplication.

My speech therapist observed that I successfully employ feminine language in my recent video “radical forgiveness”. This led me to thinking: Has the language I use in my prose evolved as I blossomed into femininity? I detail my attempt to answer this question using mathematical analysis below.

Two Caveats

I make two major assumptions in this analysis, assumptions I will address in future work:

First, I assume my writing skill remained constant throughout the last ten years. Not a great assumption in the long haul but necessary to simplify the math for this “back of the envelope” analysis.

Second, the two blogs cover different subjects, and the first one even contains source code on occasion. This may distort the clustering process described below. Again, ignoring this concern proves acceptable for this “quick-and-dirty” calculation to enable exploration of the problem domain.



Method

I download each of my blog posts and then calculated the part of speech (POS) for each word in the post. After that I computed the frequency distribution of the POSs. I then performed hierarchical clustering using a similarity matrix defined by the dot product of each pair of posts’ POS use frequency distribution vectors. The resulting dendrogram looks like:

I recommend downloading the image to view it at full size.

Each vertical line represents a blog post, and the trees linking the vertical lines indicate the degree of similarity between any two blog posts. For example, in the above image, the cyan and magenta colored posts prove similar but the green and black posts diverge significantly in terms of their POS use frequency distributions. The asterisks indicate posts created after I started expressing publicly as a woman full-time. The colors divide the tree into sections that group similar blog posts. Please note that I chose the grouping threshold manually (but rationally).

Results

By visually inspecting the density of these asterisks for the different color groups we derive an indication of how “feminine” or how “masculine” we might regard each group of blog posts. For example, we see sparse femininity in the green, yellow, and black groups; while we see enriched femininity in the cyan and purple group. The algorithm clearly found little distinction between the posts within the large red group, but even there we visually recognize sections of diminished femininity and sections of enhanced femininity.

So a linguistical difference between my pre- and post-transition writing appears to exist. But is it real? Can we conclude that my prose grew more feminine after my public transition? Not so fast! We must build a model that includes time as a variable to cancel out possible influence of improvement in my writing skill, and then test that model for significance. I’ll save this work for a later date.