background preloader

Lxml - Processing XML and HTML with Python

Lxml - Processing XML and HTML with Python
Related:  Python Forum ScrapingscrapyPython

Pattern Pattern is a web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. The module is free, well-document and bundled with 50+ examples and 350+ unit tests. Download Installation Pattern is written for Python 2.5+ (no support for Python 3 yet). To install Pattern so that the module is available in all Python scripts, from the command line do: > cd pattern-2.6 > python install If you have pip, you can automatically download and install from the PyPi repository: If none of the above works, you can make Python aware of the module in three ways: Quick overview pattern.web pattern.en The pattern.en module is a natural language processing (NLP) toolkit for English. pattern.vector Case studies

Requests: HTTP for Humans — Requests 2.1.0 documentation karrigell - A web framework for Python 3.2+ Karrigell is a Pythonic web framework, very simple to learn and to use Karrigell's design is about simplicity for the programmer and integration of all the web environment in the scripts namespace. All the HTML tags are available as classes in the scripts namespace : def index(): return HTML(BODY("Hello world")) To build an HTML document as a tree, the HTML tags objects support the operators + (add brother) and <= (add child) : def index(): form = FORM(action="insert",method="post") form <= INPUT(name="foo")+BR()+INPUT(name="bar") form <= INPUT(Type="submit",value="Ok") return HTML(BODY(form)) The scripts can be served by a built-in web server, or through the Apache server, either on CGI mode or using the WSGI interface This project is a rewriting of the Python 2.x version, adapted to Python version 3.2 and over, with some incompatibilies BuanBuan is a wiki application based on Karrigell 4 kforum is a forum application kftp is a FTP-like program to manage files and folders on line

How to Setup Your Own Web Proxy Server For Free with Google App Engine [Video Tutorial] 12 Nov 2013 Learn how you can easily create your own online proxy server for free using Google App Engine without requiring any hosting plan or even a domain name. couch mode print story Do a Google search like “proxy servers” and you’ll find dozens of PHP proxy scripts on the Internet that will help you create your own proxy servers in minutes for free. If you don’t have a web domain or haven’t rented any server space, you can still create a personal proxy server for free and that too without requiring any technical knowledge. Here’s one such proxy site that you can build for your friends in China or even for your personal use (say for accessing blocked sites from office). Go to and sign-in using your Google Account. Next Steps – Setting up a Free Proxy with Google You can edit the main.html file to change the appearance of your proxy website. This proxy works with Flash videos (like YouTube and ABC News) though not with Hulu.

An Introduction to Compassionate Screen Scraping Screen scraping is the art of programatically extracting data from websites. If you think it's useful: it is. If you think it's difficult: it isn't. And if you think it's easy to really piss off administrators with ill-considered scripts, you're damn right. This is a tutorial on not just screen scraping, but socially responsible screen scraping. Its an amalgam of getting the data you want and the Golden Rule, and reading it is going to make the web a better place. We're going to be doing this tutorial in Python, and will use the httplib2 and BeautifulSoup libraries to make things as easy as possible. Websites crash. For my blog, the error reports I get are all generated by overzealous webcrawlers from search engines (perhaps the most ubiquitous specie of screenscraper). This brings us to my single rule for socially responsible screen scraping: screen scraper traffic should be indistinguishable from human traffic. Cache feverently. Setup Libraries Choosing a Scraping Target Ending Thoughts

Python Script - Plugin for Notepad++ HTML Scraping Web Scraping Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from them and preserve the structure while we’re at it. Web sites don’t always provide their data in comfortable formats such as csv or json. This is where web scraping comes in. lxml and Requests lxml is a pretty extensive library written for parsing XML and HTML documents very quickly, even handling messed up tags in the process. Let’s start with the imports: from lxml import htmlimport requests Next we will use requests.get to retrieve the web page with our data, parse it using the html module and save the results in tree: page = requests.get(' = html.fromstring(page.content) (We need to use page.content rather than page.text because html.fromstring implicitly expects bytes as input.) XPath is a way of locating information in structured documents such as HTML or XML documents.

CubicWeb Semantic Web Framework Hola Susana, te estamos decodificando... Cada número de un teléfono tiene asignado un sonido distinto. Si nuestro oido fuera capaz de diferenciar entre estos sonidos, podríamos saber qué número fue discado con solo escuchar el sonido que este emite al presionarlo. Esto resulta muy difícil sino imposible. Eso es lo que intento hacer con la "Diva de los Teléfonos". En la siguiente tabla se puede ver como está compuesto el sonido asociado a cada una de las teclas de un teléfono usual: Componentes de frecuencia de cada número del teléfono Por ejemplo, el número 7, se compone de la suma de una sinusoide de 852Hz y otra de 1209Hz. Audio de Susana Yo lo que hice fue buscar en Youtube un video de Susana discando y luego me quedé con el audio de ese video y en particular con los segundos en los que ella disca. Configuración del Audacity de acuerdo a la guía anterior Programa "alsamixer" (ejecutado en consola) que controla la tarjeta de sonido Audio grabado mediante Audacity desde el video en Youtube Herramientas matemáticas Código Resultados

TxtRoo Launches A Yelp For The Feature Phone Market TxtRoo, a company that was invented and coded into existence from the Stanford StartupBus headed to SXSW in Austin, Texas last week, can best be described as a Yelp for the feature phone market. Like Yelp, which delivers user reviews and other local business information to both web and smartphone users, TxtRoo aims to do the same using SMS text messages. What’s most interesting about TxtRoo, however, is how quickly it was able to sign up customers. The team had commitments from half a dozen local businesses before the prototype was even finished – about 6 hours into its development. The company was one of the many teams on this year’s StartupBus event, a 72-hour hackathon that challenges entrepreneurs to design, build and pitch a business idea while traveling from their hometowns to SXSW in Austin, Texas. “The majority of Latino businesses we pitched to were so excited about our offering that then and there they decided to partner with us,” he says. Here’s how TxtRoo works:

mechanize Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize. The examples below are written for a website that does not exist (, so cannot be run. There are also some working examples that you can run. import reimport mechanize br = mechanize.Browser()" follow second link with element text matching regular expressionresponse1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)assert br.viewing_html()print br.title()print response1.geturl()print # headersprint # body br.select_form(name="order")# Browser passes through unknown attributes (including methods)# to the selected["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)# Submit current form. # print currently selected form (don't call .submit() on this, use br.submit())print br.form mechanize exports the complete interface of urllib2: