Skip to content

Composite Representation Vectors (CRVs) are a type of human-understandable word embedding developed with the intent of creating explainable AI models. This repository includes tools for explaining and working with them.

Notifications You must be signed in to change notification settings

EfficiencySimplicity/CRV-Playground

Repository files navigation

CRV-Playground

"You shall know a word by the company it keeps" - J. R. Firth

Composite Representation Vectors (CRVs) are a simple and powerful human-understandable word embedding developed with the intent of creating explainable AI models. This repository includes tools for explaining and working with them.

CRVs:

CRVs are an effective, logical word vector representation (although they can encode more than words) that store words as normalized co-occurrence tables. They are made by a fixed, logical process that allows inspection of why a value is the way it is, and encourage inspection into their behavior.

CRVs are created by generating a table of co-occurrence counts for each word and then normalizing them, producing a table of percentages.

Some notes on generating CRVs:

  • For smaller datasets, a window of 2 on each side of the word seems to give most consistent results, and different windows reveal different representations of a word.

    A whole-document window will produce the 'theme' of the word, i.e, whether is is in mostly recipes, news articles, etc., while a smaller window will be more inclined towards the "meaning" of the word

CRV Operations

Similarity:

The best discovered methods for finding the similarity of 2 CRVs (and thus 2 ideas) are min() and a more complicated square root operation. Both of these methods produce nearly identical ordering of words according to similarity, but have differing values. Minimum is preferred because:

  • It is always normalized
  • It is quick to compute bitwise
  • It is always positive.

Since a CRV's values all sum to 1, the largest value that min() can reach is 1; this is where the two CRVs are exactly equal, so the minimum is exactly the same as either one, and summed up, is one. Any increase in one of a CRV's values results in a decrease in others.

similarity of 'chicken' (recipe corpus):

image

similarity of 'hi' (diplomacy corpus):

image

similarity of '2' (recipe corpus):

image

similarity of 'betray' (diplomacy corpus):

image

similarity of '🥳' (diplomacy corpus): (it can classify emoji, too!)

image

Difference

Suppose you want to know how two words are different (i.e. how are they used differently?). CRVs make inspecting a word's particular attributes incredibly simple.

For example, here from the diplomacy corpus (a strategy game where country leaders sent chat messages asking for alliances and so on) is 'hi' and 'greetings':

'hi':

{ 0.32⋅ +0.13⋅, +0.1⋅! +0.04⋅there +0.04⋅.
+0.03⋅germany +0.03⋅russia +0.03⋅austria +0.03⋅england +0.03⋅france +0.02⋅turkey +0.02⋅— +0.02⋅i +0.02⋅italy +0.01⋅- +0.01⋅and +0.01⋅queen +0.01⋅: +0.01⋅ +0.01⋅fellow +38 others}

'greetings':

{ +0.3⋅ +0.14⋅! +0.13⋅, +0.04⋅germany +0.03⋅\n
+0.03⋅ +0.03⋅from +0.03⋅russia +0.02⋅austria +0.02⋅england +0.02⋅italy +0.02⋅my +0.02⋅queen +0.01⋅and +0.01⋅good +0.01⋅her +0.01⋅hope +0.01⋅how +0.01⋅i +0.01⋅kaiser +6 others}

How specifically are 'hi' and 'greetings' different? By subtracting 'greetings' from 'hi', we get this:

{ -0.04⋅! +0.04⋅there +0.04⋅. -0.03⋅\n -0.03⋅from
-0.03⋅ +0.03⋅france +0.02⋅turkey -0.02⋅my
+0.02⋅— +0.02⋅ -0.01⋅queen -0.01⋅good -0.01⋅her
-0.01⋅hope -0.01⋅how -0.01⋅neighbor -0.01⋅the -0.01⋅to
-0.01⋅we +50 others}

Separating negative values as 'more like "greetings"' and positive values as 'more like "hi"', we get these datapoints:

more like 'hi':

{ +0.04⋅there +0.04⋅. +0.03⋅france +0.02⋅turkey +0.02⋅—
+0.02⋅}

more like 'greetings':

{ +0.04⋅! +0.03⋅\n +0.03⋅from +0.03⋅ +0.02⋅my +0.01⋅queen +0.01⋅good +0.01⋅her
+0.01⋅hope +0.01⋅how +0.01⋅neighbor +0.01⋅the +0.01⋅to
+0.01⋅we}

From these two CRVs, we can gain quite a lot of information:

  • 'hi' is more often used to greet a country.
  • 'greetings' is most often to greet a queen, and its tone is more formal overall.
  • 'hi' appears next to the of a chat message more often, and is the more common greeting.
  • 'greetings' is more often near a new line, meaning 'greetings' is on it's own line, like the greeting in a letter.

CRV Interpretation

CRVs are not like ordinary vector representations, which are not oriented towards being easy for humans to understand, instead being optimized for compression efficiency. The fact that CRVs are made by a fixed process allows for inspection into the meaning of their values.

Consider the CRV of "bake", obtained from the recipe corpus:
{ +0.17⋅\n +0.16⋅- +0.1⋅at +0.1⋅3 +0.04⋅and +0.04⋅.
+0.03⋅for +0.03⋅1 +0.03⋅in +0.02⋅4 +0.02⋅0 +0.02⋅2
+0.02⋅5 +0.01⋅, +0.01⋅a +0.01⋅uncovered +0.01⋅pan
+0.01⋅until +0.01⋅about +0.01⋅cover +352 others}

Here we see, as expected, words that would be around the word 'bake'.

These representations can even be guessed at; take for example this CRV, from the same corpus:

{ +0.24⋅\n +0.09⋅. +0.07⋅dish +0.05⋅- +0.04⋅
+0.04⋅in +0.04⋅quart +0.02⋅a +0.02⋅and +0.01⋅,
+0.01⋅into +0.01⋅greased +0.01⋅of +0.01⋅broccoli
+0.01⋅chicken +0.01⋅with +0.01⋅potato
+0.01⋅large +0.01⋅put +0.01⋅buttered +237 others}

This is an example of one that is harder to guess, but you can get at the mood of the word. It has to do with brocolli, chicken, butter and grease, and it is a 'dish'. The word happens to be 'casserole', and the CRV above probably makes more sense in light of that information.

But why is "quart" so common next to 'casserole'? I certainly don't see the connection. And this is where the trail would stop for most types of word vectors. Maybe you'd check to see what activates parts of it, maybe you'd check similarity. Those would both be approximations, however. With CRVs, we can plainly see, by looking into the corpus, where this came from, and have it then make sense to us.

A quick search reveals:

'all ingredients in 3 quart casserole . cover'

'cheese into 2 - quart casserole dish .'

'a greased 3 - quart casserole at 4'

'1 / 2 - quart casserole , pyrex'

'1 / 2 - quart casserole dish ,'

'1 / 2 - quart casserole . '

It appears (and I have learned many other things from CRVs) that the standard measurement of a casserole is in quarts, which was certainly unexpected for me. I'd expect the measurement to be in pan-inches.

CRV's meanings are not completely discrete, though. Meaning can be spread across values, albeit in a sensible way. For example, the phrase '- off', as in dance-off, bake-off, etc. is two words that together give the idea of a competition.

Groups of words, not single words, that a word appears near give meaning to the word

About

Composite Representation Vectors (CRVs) are a type of human-understandable word embedding developed with the intent of creating explainable AI models. This repository includes tools for explaining and working with them.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published