Best way to learn data crunching?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

Grurik

Full Member
10+ Year Member
Joined
Jul 26, 2012
Messages
33
Reaction score
4
I'm doing mainly bench research, most of my experiments are nicely outlined in a small excel file and the statistical analysis is generally simple. So I rarely have to organize any data.

However, I would like to engage also in the clinical research being conducted. Mostly simple studies and some register-based ditos. My goal is to be able to fairly well be handed registers/clinical data and then start organizing it so it becomes ready for analysis (that will be made in assistance with a biostatitstician).

From talking to some students that aren't really doing it but knows of people who does says there are two types
1) Organize the data separately by copy pasting for each analysis, which can then be copied into a statistics program.
2) Working from a fixed data set and performs analysis using programming.

I guess 1) is the way to go for a short and small project, but if you really want to learn I guess 2) is the way to go. However, I have googled and have not found any guidance on this.

What would you guys say is the best way to learn the data crunching part? Is it all about programming? Are there any good books? I guess I just don't want to look like a complete fool when someone hands me data to organize and analyze.

@mimelim (happened to read one of your AMA thread in which you wrote about a monster when it comes to this)

Members don't see this ad.
 
Last edited:
It is hard to do this without a) a mentor and b) practice data sets. There are two components to data analysis. Part 1 is knowing why to use different tools to crunch the data. Part 2 is knowing how to use those different tools to crunch the data. Part 2 is 'relatively' easy, pick your statistics package (I recommend R personally) and start reading about the common techniques. Part 1 is easy to get started on (a basic biostats textbook will get you going), but to really be proficient you need to set things up with someone who knows what they are doing because even the most basic data analysis can have subtle nuances to the methodology.

As stupid as it sounds, if you have a particular field/type of study in mind, papers will have semi-detailed descriptions of what they did/why they did what they did when it comes to the data analysis and you can simply learn those techniques, why they are used and also probably just as importantly, when not to use them.

There really isn't much "data organizing" as most of that is done on the back-end within the statistical programs themselves.
 
  • Like
Reactions: 2 users
Thanks a lot - always appreciate your helpful advice on this forum! Some people I have talked to have taken the free courses on R, guess that's a good start. As for the literature, I have looked into "Essential medical statistics" with very good reviews. A statistician will be present throughout the project and hopefully they will give me some good advice. I have some programming experience (basic level - introductory course in C) from earlier, but probably only slightly above the average medical student. Most important is probably to just start programming.

Depending on the project I guess the stair looks something like 1) calculate group characteristics 2) single comparisons between variables 3) multivariate and regression analysis 4) creating own mathematical models. I guess 1-2 is possible to learn quite fast, whereas for 3-4 a statistician is necessary..

Some questions:
- Do the medical students you supervise have programming skills? What's your impression on how fast student pick up necessary skills (I guess it varies but an estimate)?
- What are the typical projects? Chart review? Handed data and "good luck get this done"? Updating clinical trials?
- What is the typical timeline?

Again, thanks @mimelim !
 
Last edited:
  • Like
Reactions: 1 user
Members don't see this ad :)
Thanks a lot - always appreciate your helpful advice on this forum! Some people I have talked to have taken the free courses on R, guess that's a good start. As for the literature, I have looked into "Essential medical statistics" with very good reviews. A statistician will be present throughout the project and hopefully they will give me some good advice. I have some programming experience (basic level - introductory course in C) from earlier, but probably only slightly above the average medical student. Most important is probably to just start programming.

Depending on the project I guess the stair looks something like 1) calculate group characteristics 2) single comparisons between variables 3) multivariate and regression analysis 4) creating own mathematical models. I guess 1-2 is possible to learn quite fast, whereas for 3-4 a statistician is necessary..

Some questions:
- Do the medical students you supervise have programming skills? What's your impression on how fast student pick up necessary skills (I guess it varies but an estimate)?
- What are the typical projects? Chart review? Handed data and "good luck get this done"? Updating clinical trials?
- What is the typical timeline?

Again, thanks @mimelim !

I've seen some presentations where the results were published with the help of MATLAB. It's user-friendly especially with projects involving modeling because of its extensive help/F1 documentation. MATLAB requires very little to no programming as a result of their comprehensive help function. Not sure if coursera (spelling?) offers a quick course on your queries but it's worth looking into since it's free.
 
There's a "Biostatistics with R" published by Springer that's supposed to be good for people looking for an introduction
 
What would you guys say is the best way to learn the data crunching part? Is it all about programming? Are there any good books? I guess I just don't want to look like a complete fool when someone hands me data to organize and analyze.

There is nothing wrong with using Excel. You can actually do some fairly complicated analyses in Excel if you know what you're doing. Some statistical packages like SPSS and Stata have user interfaces that make data management a little more intuitive. Even if you are really just getting the data into shape for analysis and running very simple descriptive analyses, an entry level stats course is still probably your best bet for learning basic principles. You might try Coursera for some entry-level courses you can do online. A good stats textbook with computer-based exercises would also be a good thing to try.

Once you're comfortable being able to organize and structure a data set you could try some simple programming/syntax to transform data, run simple descriptive stats, do some diagnostics, etc. But that's pretty difficult to self-teach. Doing some sort of course is probably your best bet.
 
(Hopefully) some answers to your questions.

Some questions:
- Do the medical students you supervise have programming skills? What's your impression on how fast student pick up necessary skills (I guess it varies but an estimate)?

I believe that most prior posters have answered this nicely. One answer: depends on the statistical package. What follows is only my personal opinion.

R: 0$. Freeware but requires programming knowledge of the object oriented kind. Expect to invest some time getting desired packages and debugging for more advanced analyses.
Prism: $. For general scientists. Easy to use with simple UI, but no control over more subtle statistical parameters. Makes pretty graphs. Little programming knowledge required.
SPSS: $ to $$ depending on package. Also for general scientists. Again, easy to use with a few more options. A little more control over statistical parameters. Little programming knowledge required..
Stata: $$ to $$$. Used by many statisticians. Command line type interface. Flexible control over statistical parameters. Downloadable modules enable updates as needed. Programming knowledge requirements can range from simple to intermediate.
MATLAB: $$$. More of an engineering software designed to do everything. Not a dedicated statistical package. Requires an intermediate amount of programming skill.
SAS: $$$$. Industry standard for statistics. Needs at least intermediate programming skill (read: write a basic program) to do even the simplest analysis.
Minitab: $$$$. Looks like a souped up version of SPSS.

And there are many others. These are the most frequently mentioned in biomedical literature.


- What are the typical projects? Chart review? Handed data and "good luck get this done"? Updating clinical trials?

Projects can start from the very abstract with meta-analyses (which have their own set of dedicated software packages) down to case reports (if you have a very interesting patient). Avoid meta-analyses unless you have a PI who knows what they're doing. I have peer reviewed and destroyed poorly done meta-analyses.

If you're lucky, your PI will have assembled a registry, or better yet trial data, for you to analyze. Your PI should periodically meet with you to go over issues and progress for such projects. Weekly is good. For students with time, I expect no more than updating demographics, lab results, and clinical study reports. An ambitious student (with proper credentialing and training) can contact patients and their clinics for outcome data.

A word about chart reviews. Too many well meaning junior faculty think, "Oh, I can get a database started with a chart review," and run off to do so. If you're unlucky and trip some privacy screen, an inquiry may lead to an IRB ethics board bopping you over the head and asking, "Why didn't you clear this with us?" Repeated violators can be banned from research entirely. Even if the research is IRB-exempt, it is safer to have the IRB tell you so.

Bottom line: Make sure your clinical human-based project has been cleared with an IRB.


- What is the typical timeline?

Depends on how far along a project has gone.

Starting from scratch may take at minimum a year, depending on how big a database to be assembled and other factors.

If a database has been fully and properly assembled, then a 6-month window or sooner to manuscript submission is very possible.
 
these are often available as a course at your local university.
i'd say pick your location first...reason is that regionally, your mentors will have gone through the same univ / mph program. they will already be analyzing their research using one system. you'll probably be expected to do at least the MPH and they will invariable have a course for SAS, Stata or such.
 
Getting an MPH to learn a statistical package = overkill.

Also, agree with Excel being suitable for most basic statistical analyses.
 
Top