background preloader

20 short tutorials all data scientists should read (and practice)

20 short tutorials all data scientists should read (and practice)

How to Become a Data Scientist These days you can get a degree in data science so you can show your diploma that certifies your credentials. But these are relatively new so, with all due respect, if you only recently got your degree you are still a beginner. Those of us who use this title today most likely came from combination backgrounds of business, hard science, computer science, operations research, and statistics. What you call yourself is one thing but what your employer or client is looking for can be quite a different kettle of fish. A lot has been written about data scientists being as elusive as unicorns. Not being a unicorn I’d say this sets the bar pretty high. All of this confusion over what we’re called and what we actually do can make you down right schizophrenic. Four Types of Data Scientists The information here comes from the O’Reilly paper “Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. There are 40 pages of good analysis here so this will be only the highest level summary.

Small Pond Science | Research in a teaching institution District Data Labs - How to Transition from Excel to R How to Transition from Excel to R An Intro to R for Microsoft Excel Users Tony Ojeda In today's increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they've become accustomed to using. If you're an Excel user and you're scared of diving into R, you're in luck. Excited? Quick note before we do: There are usually multiple ways to do everything in R. The Basics Let's start with the basics. You'll also want to install and load the ggplot2 library, which not only contains the data set we want to use but will also come in handy when we get to creating charts and graphs later. install.packages("ggplot2") install.packages("dplyr") library(ggplot2)library(dplyr) OK, so let's take an initial look at the data. Summaries

Мифы и легенды про Big Data / Блог компании ВымпелКом (Билайн) Один из наших кластеров для пилотных задач (Data node: 18 servers /2 CPUs, 12 Cores, 64GB RAM/, 12 Disks, 3 TB, SATA — HP DL380g) — Что такое Big Data вообще? Все знают, что это обработка огромных массивов данных. Но, например, работа с Oracle-базой на 20 Гигабайт или 4 Петабайта — это ещё не Big Data, это просто highload-БД. — Так в чём ключевое отличие Big Data от «обычных» highload-систем? В возможности строить гибкие запросы. — Откуда берётся эта новая нагрузка? — Есть пример такой задачи? — И как это решается? — Так давайте просто промасштабируем их — и проблема решится? — Так что получается в итоге? — Но ведь это чудовищно медленно, разве не так? Короткие запросы с малым количеством join’ов. — Какие есть известные примеры использования Big Data? — А почему тогда все на конференциях говорят про Big Data? — Получается, что одна из целей Big Data — возможность уйти от долгих проектных циклов? — Есть примеры уже решенных задач, где это было видно? — Какова структура платформы?

The Tree of Life refsmmat Data Science Cheat Sheet I will update this article regularly. An old version can be found here and has many interesting links. All the material presented here is not in the old version. 1. A laptop is the ideal device. Even if you work heavily on the cloud (AWS, or in my case, access to a few remote servers mostly to store data, receive data from clients and backups), your laptop is you core device to connect to all external services (via the Internet). 2. Once you installed Cygwin, you can type commands or execute programs in the Cygwin console. Figure 1: Cygwin (Linux) console on Windows laptop You can open multuple Cygwin windows on your screen(s). To connect to an external server for file transfers, I use the Windows FileZilla freeware rather than the command-line ftp offered by Cygwin. You can run commands in the background using the & operator. $ notepad VR3.txt & A few more things about files Other extensions include Files are not stored exactly the same way in Windows and UNIX. File management 3. Examples

Dynamic Ecology | Multa novit vulpes Intuitive Biostatistics - Intro Big Data Technology Suite of Cloud Services Big Data doesn’t need to be so hard We provide value faster and with less complexity with a cloud services approach Infochimps™ Cloud is a suite of cloud services that makes it faster and far less complex to develop and deploy Big Data applications. Our cloud services handle all of the complex Big Data technologies and processes, giving you a simple, developer-friendly interface. Infochimps Cloud lets you focus on creating the applications that will drive value for your business instead of spending your time managing a Big Data “infrastructure stack.” Cloud::Streams — Streaming data and real-time analyticsCloud::Queries — NoSQL database and ad hoc, query-based analyticsCloud::Hadoop — Elastic Hadoop clusters and batch analytics Infochimps Cloud eliminates all the implementation headaches caused by Big Data enabling your Big Data applications to be completed quickly and fully achieve their objectives. Flexible, cost-effective cloud deployment

blog :: Peccoud Lab DNAFactory: A Gene Synthesis Game We present our very first video game! Targeted toward middle- and high-school students, DNAFactory demonstrates the steps a Gene Synthesis factory or laboratory would go through to produce DNA molecules; specifically, when creating a DNA sample, the target sequence is subdivided into shorter DNA fragments called oligos, which can be synthesized using a chemical process. […] Developing a culture of security for synthetic biology I was recently interviewed by Andrew Snyder-Beattie from the Future of Humanity Institute. Fostering Transdisciplinary Science with Cyberinfrastructures I want to tell you how cyber-infrastructures can enable large research research projects involving specialists from multiple disciplines and why it makes economic sense to develop such an infrastructure. Chase the dream, not the money: the dirty little secrets of proposal writing Development of Synthetic Flu Vaccines Google DNA Map I owe to Mary Mangan from OpenHelix this gem.
