background preloader

Language

Facebook Twitter

Évaluation automatique de textes et cohésion lexicale. 1Répondre à une question ouverte lors d’un examen ou rédiger une dissertation sont des dispositifs évaluatifs, mais également des outils de formation particulièrement importants dans le champ de l’éducation. Non seulement ils permettent une évaluation plus fine et plus riche des compétences de l’étudiant qu’un questionnaire à réponses fermées (Magliano et Graesser, 2012), mais surtout la mise en évidence par l’évaluation des points forts et des points faibles de ces textes est formative pour l’étudiant. La lourdeur de l’évaluation et les fréquents désaccords observés lorsque les avis de plusieurs évaluateurs sont comparés limitent souvent le recours à ces modalités d’évaluation (Miller, 2003). 2Dans ce contexte, l’évaluation automatique de textes est devenue un enjeu capital.

Selon la tâche demandée à l’étudiant, deux grands types d’approches ont été développés. Tableau 1. 8Les mêmes calculs peuvent être effectués sur les vecteurs qui représentent les documents analysés. Figure 1. 5.1. LanguageIdentifierBenchs - Nutch Wiki. Introduction This page provides some performance (code speed) and precision (identification accuracy) benchmarks of the LanguageIdentifierPlugin. These benchmarks were produced by analyzing results from the previous version (nutch-0.7-dev) and the patches NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch (see NewLanguageIdentifier for more details). These data can be usefull if you want to contribute in increasing the LanguageIdentifierPlugin performance and/or precision, or if you want to tune precisely your Nutch configuration. Performance Data set These performance benchmarks were produced by testing the LanguageIdentifierPlugin on a set of 492 french files representing a total size of 171,3 Mo.

These files were extracted from the European Parliament Proceedings Parallel Corpus 1996-2003 Release v2. Raw results The following matrix shows the LanguageIdentifierPlugin processing time in ms for many versions. Graphical representation Graphical representation (log axis) TextCat Language Guesser. TextCat is an implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994. This paper was available at: Now you can download it here. I have applied the technique to implement a written language identification program. At the moment, the system knows about 69 natural languages (counting Esperanto as a natural language). The textcat programme is not any langer actively maintained by me.

Installation Edit the text_cat script to have the first line point to your Perl binary. Usage text_cat -h displays usage information. Remotely related links Survey on the State of the Art in Human Language Technology contains a chapter on language identification (both for spoken and written language). Interesting test cases. Saffsd/langid.py. Chromium-compact-language-detector - C++ library and Python bindings for detecting language from UTF8 text, extracted from the Chromium browser. Language-detection - Language Detection Library for Java.

This is a language detection library implemented in plain Java. (aliases: language identification, language guessing) Generate language profiles from Wikipedia abstract xml Detect language of a text using naive Bayesian filter 99% over precision for 53 languages Available packages are on Downloads . 03/03/2014 Distribute a new package with short-text profiles (47 languages) Build latest codes Remove Apache Nutch's plugin (for API deprecation) 01/12/2012 Migrate the repository of language-detection from subversion into git for Maven support 09/13/2011 Add language profile of Estonian, Lithuanian, Latvian and Slovene. Import java.util.ArrayList;import com.cybozu.labs.langdetect.Detector;import com.cybozu.labs.langdetect.DetectorFactory;import com.cybozu.labs.langdetect.Language; Copyright (c) 2010-2014 Cybozu Labs, Inc. All rights reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.

Language Detection Library for Java. Language Identification System: How to recognize other languages than English. When you are trying to understand humans the first step is to be able to hear/read them and the second one is to be able to indentify the language they use. You can understand how difficult it is, the first time that you are in a place where nobody speaks your language. What's the problem? When you have to process and understand a text with Text Engineering tools, the first step is to indentify the language in order to use the right set of data and tools. You won't use the same tokenizer, and processing tools if you are reading english or hebrew. The problem gets more complicated if you are trying to recognize languages that look the same, but are different, like Italian and Spanish or even worst Catalan and Portugese. p.e: "Presunto" is written in spanish and portuguese exactly the same but means things completely different in both languages.

But how is it done? And here are the results: ... those are the times: Snowball StopWords Google Hybrid Time 4-5 ms 1-2 ms 110 ms 5 ms en = "Ingles" Primary Language in HTML. World Wide Web Consortium Note 13-March-1998 This version: Latest Version: Editor: M.T. Carrasco Benitez [CAR]<manuel.carrasco@emea.eudra.org> Status of this document This document is a NOTE made available by the W3 Consortium for discussion only. This document recommends how to mark the primary language(s) in a HTML document. Abstract In HTML elements, the lang attribute specifies the natural language. Overview Most of the existing documents are monolingual. Some documents are bilingual and few are trilingual or n-lingual. The main reason for the existence of n-lingual documents is political; i.e., in certain situations it is not politically correct to assume a base language. Another approach to choose the language is to set the client (e.g., the browser) to the preferred language(s).

Where to specify the primary language(s) There should be one recommended place to specify the primary language(s). References. Understanding The SEO Challenges Of Language Detection. Last time, I reviewed how to effectively manage multilingual content segmentation by looking at ways to use directories, parameters and other methods to optimize local market content.

Once we have our content sorted, the next challenge we have is how to direct users—and more importantly, search spiders (crawlers)—to this content. The purpose of this article is not to argue the user experience or even the philosophical issues of your method of matching visitors to specific country or language content, but rather to ensure that whatever means you choose, you do not negatively impact your search performance. The problem with language detection and redirection So what’s the big deal? It makes sense that a person from Germany gets redirected to our German content, doesn’t it? Let’s review some of the more common detection and redirection processes used by sites and their negative implications on search spiders. Dynamic detection and redirecting IP location. Browser language preference. Language Detection. How to detect which language a text is written in? Or when science meets human! « The Nameless One.

As I mentioned earlier in my spam attack analysis, I wanted to know which language spams I receive are written in. My first bruteforce-like idea was to take each word one by one, and search in english/french/german/… dictionaries whether the words were in. But with this approach I would miss all the conjugated verbs (until I had a really nice dictionary like the one I have now in firefox plugin). Then I remember that languages could differ in the distribution of their alphabetical letters, but well I had no statistics about that… That was it for my own brainstorming, I decided to have a look at what google thinks about this problem.

I firstly landed on some online language detector… The easy solution would have been to abuse this service which must have some cool algorithms, but well I needed to know what kind of algorithms it could be, and I didn’t want to rely on any thirdparty web service. I found the N-gram approach on page 8 (chapter 4) rather interesting. Like this: LanguageIdentifier.com -- Automatic Language Detection Software. Have you ever come across documents or websites and not known what language it is written in? The Lextek Language Identifier is capable of identifying not only what language it is written in but also its character encoding and what other languages are most similar. This will alleviate your wondering of what languages you are working with or looking at. Free Download!

Free For Non-Commercial Use! The Lextek Language Identifier is free for you to use for your personal use on your non-commercial projects. Please feel free to share it with your friends those you work with. The Language Identifier is free to download and free to share with others for non-commercial purposes. Easy To Use An easy to use interface gives you the ability to quickly and easily identify the language of your documents. More Languages! The Lextek Language Identifier offers more language and encoding modules than any other language identifier. Identify the Encoding / Character Set Software Developers Version Available.