background preloader

Screen Scraping with Node.js

Screen Scraping with Node.js
You may have used NodeJS as a web server, but did you know that you can also use it for web scraping? In this tutorial, we'll review how to scrape static web pages - and those pesky ones with dynamic content - with the help of NodeJS and a few helpful NPM modules. Web scraping has always had a negative connotation in the world of web development - and for good reason. In modern development, APIs are present for most popular services and they should be used to retrieve data rather than scraping. The inherent problem with scraping is that it relies on the visual structure of the page being scraped. Despite these flaws, it's important to learn a bit about web scraping and some of the tools available to help with this task. Note: If you can't get the information you require through an API or a feed, it's a good sign that the owner does not want that information to be accessible. Scrapers can be written in any language, really. Let's start with the simple use-case: static web pages.

Data mining local radio with Node.js More harpsicord?! Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I've noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord. Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory. This article outlines the details of this investigation and especially the process of collecting the data. If it ain't baroque... A harpsicord is in many ways similar to the piano. The harpsicord can sound tinny to modern ears. At the start of the 16th century, the newly invented fortepiano began to push both the harpsicord and its close relative, the clavicord out of favor. These eras are: One exception is opera. Collecting the data Cheerio

Web Development Course Online - How To Build A Blog When does the course begin? This class is self paced. You can begin whenever you like and then follow your own pace. It’s a good idea to set goals for yourself to make sure you stick with the course. How long will the course be available? This class will always be available! How do I know if this course is for me? Take a look at the “Class Summary,” “What Should I Know,” and “What Will I Learn” sections above. Can I skip individual videos? Yes! What are the rules on collaboration? Collaboration is a great way to learn. Why are there so many questions? Udacity classes are a little different from traditional courses. What should I do while I’m watching the videos? Learn actively! V8 javascript VM and Node.js memory management options | O sNAp Memory management behavior is one of the first topics I wanted to understand in node. This will be part one of two articles in which I intend to explore: Memory management / gc options in the V8 VM that runs node.js applications. Debugging / memory leak analysis for running node servers. At a high level V8 uses a generational memory model with a copy collector and incremental mark and sweep. Configuring V8 heap sizes Out of memory errors? --max_new_space_size (in kBytes) Control the size of the new generation. --max_old_space_size (in Mbytes) Control the size of the old generation. --max_executable_size (in Mbytes) The code space size. Controlling when GCs occur in V8 By default, V8 will perform garbage collections on failed allocations, after every N allocations, and on a notification of idle condition by its embedder. --gc_global This controls whether V8 will do automatic garbage collection after every gc_interval allocations. --gc_interval –-nouse-idle-notification --expose-gc --trace_gc

How to Finally Play the Guitar: 80/20 Guitar and Minimalist Music When will you stop dreaming and start playing? (Photo: Musician “Lights”, Credit: Shandi-lee) I’ve always wanted to play the guitar. It started as a kid, listening to my dad play around the fireplace during the holidays. But I never thought I could do it myself. Despite tackling skills as esoteric as Japanese horseback archery, I somehow put music in a separate “does not apply” category until two years ago. My fascination with guitar wasn’t rekindled until Charlie Hoehn, an employee of mine at the time, showed me the 80/20 approach to learning it. This post explains how to get the most guitar mileage and versatility in the least time… Do you have any additional tips, whether for guitar or applying the 80/20 principle to another instrument? Enter Charlie Almost everyone has fantasized about performing music in front of a huge screaming crowd at some point in their life. Comprehensive comes later. The Ground Rules 1. 2. 3. Getting Started Next, you’ll want to buy a capo. Capo on the second fret.

dominictarr/JSONStream style guide Opinions are like assholes, every one has got one. This one is mine. Punctuation: who cares? Punctuation is a bikeshed. This post is concerned with higher-order style. Be obvious Don't do something complex just to make your api simpler. Example, avoid chaining DSLs. This is bad: thing.when('something').then(doThing) It's not really obvious how when relates to dothing. Chaining where you simply return this is acceptable. Be idiomatic, or not. if possible, make your code follow the APIs in node core. If you don't do this, you need documentation. if your can't follow an idiomatic API precisely, do something completely different. ALWAYS PASS ERR IN CALLBACK, (an event listener is not a callback, so this doesn't apply in that case) If you have a function called createServer, it should return a server, and it should have a listen function. createServer should never start the server listening. my favorite API from node is Stream (more on that later) "all you need is lambdas" -- John Lennon now, stream it:

Botsikas' Blog: Node.js modules cross platform compilation using gyp Update: I have made a pull request where you can find the updated tools discussed in this article, located here Node.js has been using waf (node-waf) to configure and build modules up to version 0.4. From v0.6 and on, the team has moved on to gyp (Generate Your Projects) which seems to be a bit more promising when it comes to cross platform compilation. This post shows how to create a simply gyp file to build your own custom native node.js modules and provides some scripts to automate the project generation process. A bit of history Gyp is a google project that was created to support cross platform building of the opensource chromium project. Node-waf vs gyp Up to version 0.4 the node.js team offered node-waf (a waf 1.5.3 wrapper script) to configure and build modules for node.js. Node Module’s gyp file I have edited a simple gyp file (see the end of this post for source code) to compile the simple hello world native nodejs module I have used in my previous posts (here and here).

Alexander Luksidadi's Blog » ExpressJS without Jade? Use Underscore template! Many of you must have felt like a burden knowing that Express recommended you to learn another template language (Jade). Don’t worry, you can code all your templates on HTML using underscoreJS! Oh yay? Let’s take a look on how you implement that on your express app. First install express package, create your express app: $ npm install -g express $ express . Install your underscore package $ npm install -d underscore If you edit Now, all you need to do is, to comment out 1 line and register underscorejs: Now, go to routesindex.js : $ vi routes/index.js Change the template name from ‘index’ to ‘index.html’: Next, go to views directory and create layout.html And last, still in views directory, create another file called index.html And there you go.. you can write your HTMl code in peace =)

Things I wish I knew about MongoDB a year ago I’ve used MongoDB for over a year at scale at both Heyzap and Bugsnag and I’ve found it to be a very capable database. As with all databases, there are some gotchas, and here is a summary of the things I wish someone had told me earlier. Selective counts are slow even if indexed For example, when paginating a users feed of activity, you might see something like, In MongoDB this count can take orders of magnitude longer than you would expect. There is an open ticket and is currently slated for 2.4, so here’s hoping they’ll get it out. Inconsistent reads in replica sets When you start using replica sets to distribute your reads across a cluster, you can get yourself in a whole world of trouble. This is compounded if you have performance issues that cause the replication lag between a primary and its secondaries to increase to minutes or even hours in some cases. Range queries are indexed differently I have found that range queries use indexes slightly differently to other queries. Profiler

moshen/node-googlemaps

Related: