• Welcome to The Cave of Dragonflies forums, where the smallest bugs live alongside the strongest dragons.

    Guests are not able to post messages or even read certain areas of the forums. Now, that's boring, don't you think? Registration, on the other hand, is simple, completely free of charge, and does not require you to give out any personal information at all. As soon as you register, you can take part in some of the happy fun things at the forums such as posting messages, voting in polls, sending private messages to people and being told that this is where we drink tea and eat cod.

    Of course I'm not forcing you to do anything if you don't want to, but seriously, what have you got to lose? Five seconds of your life?

Butterfree
Reaction score
2,222

Profile posts Latest activity Postings About

  • re: tumblr: not to make you feel worse about that whole thing (hopefully) but I was terrified of you for the longest time lmao
    Okay, so I actually went and read the damn paper describing the algorithm:

    In any suffix stripping program for IR work, two points must be borne in
    mind. Firstly, the suffixes are being removed simply to improve IR
    performance, and not as a linguistic exercise. This means that it would not
    be at all obvious under what circumstances a suffix should be removed, even
    if we could exactly determine the suffixes of a word by automatic means.

    Perhaps the best criterion for removing suffixes from two words W1 and W2 to
    produce a single stem S, is to say that we do so if there appears to be no
    difference between the two statements `a document is about W1' and `a
    document is about W2'. So if W1=`CONNECTION' and W2=`CONNECTIONS' it seems
    very reasonable to conflate them to a single stem. But if W1=`RELATE' and
    W2=`RELATIVITY' it seems perhaps unreasonable, especially if the document
    collection is concerned with theoretical physics.

    So the intent is to find "related words," rather than words of any particular grammatical relation. I guess my beef with body/bodily is that they aren't related enough, or not in the way that I'd be interested in; like, I can get behind bold/boldly both getting changed to "bold," since they strike me as being more or less the same. You could say that "boldly" just means "in a bold way." But you couldn't do the same thing with body/bodily; bodily doesn't mean "in a body way." But whether or not correcting bodily to "body" is correct may change depending on the application or the intent of the person working with the data.
    re: tumblr post: no ok every time i reflect on tqftl nowadays i wonder "why the fuck does may have a butterfree" like I gathered pretty early on that all of the Pokémon between Mark and May were your favorites but in retrospect having a Butterfree totally doesn't mesh up with May's established character especially when she released Quilava lmao
    Huh, interesting. It sounds like a pretty trippy language to learn, not gonna lie. :P

    Oh... no, the idea behind stemming isn't to try and identify what words have the same root; it's to group together different forms of the same word--dealing with plurals and verb conjugations, basically. (I guess that explains why it doesn't do anything with contractions, derp.) "Body" and "bodily" aren't different forms of the same word; they're obviously related, but they aren't even the same part of speech. Conflating "bodily" and "body" definitely isn't the desired behavior.
    Apparently the OED has around 59 million words; a ton of those are going to be archaic/super technical/otherwise rare, but even so, it looks like English is maybe one to two orders of magnitude bigger. I can see a lookup table being more appealing in light of that... but only a little more appealing, honestly. :P

    Hmm, so are there easy ways to distinguish whether a word is an adjective, noun, verb, etc.? Would recognizing the word is, say, a masculine adjective before stemming help with the context issue you mentioned for the -ur suffix? Or are there no consistent rules even within particular classes? Overall it sounds like the problem is a lot of redundancy and context-dependent transformations?

    I dunno, I don't think that would really solve the problem, though. So let's say you remember that you changed "body" to "bodi." You also changed "bodily" to "bodi," though, and now there's no way to distinguish between the two. No matter what you complete the "bodi" stem to, you're going to misclassify some of the words, assuming both "body" and "bodily" appear in the original document. You can always choose the most abundant expansion in order to reduce the error introduced as much as possible, and in this case it doesn't matter much since "bodily" is used so rarely compared to "body," but I'm not sure that the improvement in accuracy would be worth the overhead of storing all the transformations done on each word. It would definitely solve the problem of unrepairable stems (e.g. the "tri" that ends up in the Salvage cloud).
    Kind of. Sometimes it just chops, sometimes it chops and replaces (see below). The drawbacks of the algorithm are well-known, but for most quick and dirty applications it's considered good enough.

    Haha, is Icelandic just so irregular that you can't write a sensible pretty-good heuristic? Is it a smaller language than English? (My hunch is that English is one of the larger languages because of how much it cribs off others, but I really don't know, especially relatively speaking--like, same order of magnitude?) My understanding is that a lookup table for English would be so huge and onerus (and in particular difficult to maintain; "welp guess we've got to add that rule for iPads -> iPad now") that an algorithmic approach is greatly preferred.

    A very cursory look suggests that contractions are actually handled at a different point in the preprocessing step, but none of the pipelines I've looked at so far do anything special for them. The assumption is that you've already done some transformation on the contractions, I think. I cribbed the basis of my code off a demo used to analyze Reddit posts, so formality definitely isn't an expected prerequisite.

    It's not! That's "body," actually, although "bodily" would also be lumped into that same bin. You can look back at the non-stemmed word clouds to see it in all its correct glory. You recall that I said "tri" in one of the Salvage clouds is actually "try/tries?" Words ending in "y" generally have it replaced with an "i," which handles things like the aforementioned "try/tries," "fly/flies," "cry/cries," etc. This also causes a lot of hilarity on words that end in "y" but aren't verbs; you end up with a lot of creative spellings of character names, like "Luci" and "Sparki."

    For this reason, after you've stemmed the text, you then have to go back and try to repair the stems. This is done by taking the stem and looking for words in the original corpus that are just the stem + some suffix, then assigning the stem to one of them (in this case the shortest possible suffix). Some examples of stems that were correctly repaired were "suicun" to "suicune" and "gyarado" to "gyarados." However, because letters got replaced in the conversion from "body" to "bodi," there's no way to get back the actual, original word. "Bodily" was apparently the shortest valid extension of "bodi" in the corpus, so that's what "body" became. "Sparky" comes out as "sparking" for the same reason.
    Okay, I got the stemmer working. It had more impact on Salvage than either of your 'fics, which was to be expected, and on the enrichment clouds more than on the tf-idf clouds. You can, however, try spotting the stemming artifacts ("tri" is "try" and family, for example).

    TQFTL: (tfidf) (enrichment)
    Morphic: (tfidf) (enrichment)

    Obviously the stemmer doesn't take care of contractions properly, since we've still got "hed" and "didnt" and so on showing up very prominently. It also can't handle different forms of "said," since it's an irregular verb. There may be another algorithm that handles these cases individually/better.

    Salvage: (tfidf) (enrichment)

    The stemming really did make some significant changes to these clouds. Although they're still fairly verb-y, you definitely get more interesting nouns. You also see the cursing showing up a bit more. And the verbs that are left are generally more interesting/unusual ones, like "scatter." I'm particularly intrigued by "online" showing up in the enrichment cloud, since I can't even remember using that word once...

    Also "fucking" and "fuck" got binned together into one gigantic "FUCK," which I think might be an even better description of the story as a whole.
    It really does. I write sophisticated fanfiction for a discerning audience. :P

    That's just me faffing about, applying different color palettes to the output. The gradient-looking ones are, yes, just gradients, where the intensity corresponds to the degree of enrichment/tf-idf score. The others use palettes intended for qualitative data: the colors indicate no relationship, but for this particular visualization the words are binned so that everything with the same color has roughly the same value, just as with size. It's rather gauche to do that, but I think it makes some of the clouds look prettier. The best data scientist.

    That was my initial thought re: how common "pokemon" (with accent) is in tQftL. However, while it's true that only around 2/3 of the stories use it, even when I calculate the background frequency only from those stories where it appears, tQftL still uses it about twice as often. That's a pretty substantial increase over the background: most of the words in the clouds are something like 1.02x background. To put it in perspective, tQftL uses "pokemon" more than three times as often as Salvage uses "fucking."

    Like I said, I did try to stem the documents, and that *should* have taken care of the verb variants, "he'd" getting mapped to "he," and so on. After a bit of digging, though, it looks like the stemmer may have been failing silently because of an unfulfilled dependency. I'll try installing that tomorrow and see if it gives better results.
    Yup, I removed all stop words from the corpus--things like "the," "and," "of," what have you.

    Anyway, I got carried away and spent altogether too much time playing with this. I fixed the frequency-only problem, made a bunch more maps, and chattered about them a bit here.
    I made a QftL word cloud!



    You can see "pain" in the upper-right corner there.

    I did one for Morphic as well:

    [img]

    Unfortunately because of the way the data was processed, these are just raw word frequency counts, which is why you have "workhorse" words like "wasn't," "can," and so on making up a lot of the data. I think you can still see some interesting things in them, though, especially if you compare the two images.

    As a note, I had to strip "pokémon" out of the QftL one because otherwise it completely crushed all the other words and made them really difficult to see. I also removed all character names (that I spotted; there are probably some more minor ones I missed floating around in there) because, again, they'd show up a ton otherwise.
    I don't know if you've noticed this already, but I figured I'd point out that the favicon reverted to the default.
    You don't still have that Yellow save file anymore, do you? Regardless you need to figure out the glitched Jolteon
    I wasn't sure why Duane was back in the khert considering he like fell out of it a few pages ago

    (Unrelated: You have to - this is a Life Requirement - put "Oh hi, Mark" somewhere in TQftL)
  • Loading…
  • Loading…
  • Loading…
Back
Top Bottom