background preloader

Cluster Analysis and Strings

Facebook Twitter

Tidy Text Mining with R. A bit of benchmarking with string distances. After my last post about the stringdist package, Zachary Mayer pointed out to me that the implementation of the Levenshtein and Jaro-Winkler distances implemented in the RecordLinkage package are about two-three times faster.

A bit of benchmarking with string distances

His benchmark compares randomly generated character strings of 5-25 characters, which probably covers many use cases involving natural language. If your language happens to be denoted in single-byte encoding that is, but more on that below. Here's Zachary's benchmark (rerun by myself). Auch, indeed. As we all know, premature optimization is the root of all evil so I felt it was time for some mature optimization.

Checking the differences between stringdist and RecordLinkage I found that RecordLinkage uses C code that works directly on the char representation of strings while I designed stringdist to make sure that multibyte characters are treated as a single character. And indeed, the relative difference decreased from a factor of 1.8 to 1.25. Yowza! Clustering Search Keywords Using K-Means Clustering. One of the key tenets to doing impactful digital analysis is understanding what your visitors are trying to accomplish.

Clustering Search Keywords Using K-Means Clustering

One of the easiest methods to do this is by analyzing the words your visitors use to arrive on site (search keywords) and what words they are using while on the site (on-site search). Although Google has made it much more difficult to analyze search keywords over the past several years (due to their passing of “(not provided)” instead of the actual keywords), we can create customer intent segments based on the keywords that are still being passed using unsupervised clustering methods such as k-means clustering.

Concept: K-Means Clustering/Unsupervised Learning. Is it possible to list files with two pattern options. Count occurences of a character within a string. Counting defined character within String. On Jul 5, 2010, at 9:04 AM, Kunzler, Andreas wrote:

Counting defined character within String

Truncate by Delimiter in R. Sometimes, you only need to analyze part of the data stored as a vector.

Truncate by Delimiter in R

In this example, there is a list of patents. Each patent has been assigned to one or more patent classes. Let's say that we want to analyze the dataset based on only the first patent class listed for each patent. Raw text strings for file paths in R. Import All Text Files in A Folder with Parallel Execution. Handling and Processing Strings in R. Posted on September 22, 2013.

Handling and Processing Strings in R

Paste, paste0, and sprintf. I find myself pasting urls and lots of little pieces together lately.

paste, paste0, and sprintf

Now paste is a standard go to guy when you wanna glue some stuff together. But often I find myself pasting and getting stuff like this: Rather than the desired… When I get into those situations I think, “Oh better use collapse instead”; but never really think before using paste (That is whether I collapse or sep and why). Escape Characters. Frequently Asked Questions on R Version 3.1.2014-04-05 Table of Contents 1 Introduction This document contains answers to some of the most frequently asked questions about R. 1.1 Legalese This document is copyright © 1998–2014 by Kurt Hornik.

Escape Characters

This document is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This document is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Copies of the GNU General Public License versions are available at 1.2 Obtaining this document The latest version of this document is always available from From there, you can obtain versions converted to plain ASCII text, GNU info, HTML, PDF, as well as the Texinfo source used for creating all these formats using the GNU Texinfo system. 1.4 Notation.

3 Ways to Format R Code for Blogger. If you are like me originally then you might not think it is worth it to spend the extra energy to format your code.

3 Ways to Format R Code for Blogger

After all people can just copy what you have and paste it into their preferred editor which will do its own formatting, no sweat. Well, this might be true for some code in some languages, but in R it really is not gong to fly. The problem is the ubiquitous <- which is very easily confused with html tags which take the form <tag>. Clustering · OpenRefine. How to edit cells by clustering.

Clustering · OpenRefine

OpenRefine's text facets are a great mechanism for surfacing patterns from your data. Consider a data set that contains people's names entered in two different ways: "first_name middle_initial last_name", and "last_name, first_name". A text facet on that column might reveal Say you want to change every name to "first_name last_name", then using that facet, you would need to edit "Anderson, Andy" to "Andy Anderson", "Beaufort, Beatrice" to "Beatrice Beaufort", and so forth.

And of course there are the occasional middle initials to worry about. The Clustering feature can be accessed in 2 different ways. Regular expressions in R. Doing extensive text manipulation in R would be painful; the R language was developed for analyzing data sets, not for munging text files.

Regular expressions in R

However, R does have some facilities for working with text using regular expressions. This comes in handy, for example, when selecting rows of a data set according to regular expression pattern matches in some columns. R supports two regular expression flavors: POSIX 1003.2 and Perl. How to Use Regular Expressions in R. R supports the concept of regular expressions, which allows you to search for patterns inside text.

You may never have heard of regular expressions, but you’re probably familiar with the broad concept. If you’ve ever used an * or a ? To indicate any letter in a word, then you’ve used a form of wildcard search. Regular expressions support the idea of wildcards and much more. Regular expressions in R vs RStudio. The 'regex' family of languages and commands is used for manipulating text strings. More specifically, regular expressions are typically used for finding specific patterns of characters and replacing them with others.