Thursday, June 6, 2013


I'm starting this blog to publicize what I'm working on for multiple reasons,
but mostly because I'm not sure how much of what I'm working on will ever get
formally published in a peer-reviewed journal.  I think the next best thing is
putting the software I've developed out there for the others to use or at least
determine what is useful or not.  Secondly, I want others to know what I'm
working on for a variety of reasons.  So, hopefully, whatever I have here will
be useful to someone.  If you do use any of this, please shoot me an email to
let me know you are using it and how.  Thanks.

I recently started working on a GWAS experiment involving two candidate gene chips.  The genomic inflation factor on these is high and I need to find a way to bring it down.  If you have no idea what I'm talking about, welcome to the club.  I had no idea either.  I came across a couple of papers and tutorials that did a pretty good job of explaining how to do a GWAS analysis.  3 of them are from Nature Protocols (or Nature Methods).  One of the things I need to do with normalize/control for ethnicity.  Doing so from the candidate gene chips meant getting the bead ids of each sample, determining the blinded sample id (SID), from the SID get the original GWAS chip, and map the SID to the GWAS chipid+chipsection.  Merge the GWAS chip data with HapMap3, do a PCA on the merged data, cluster the results and determine where the samples fall in ethnicity group in relation to the HapMap data.  This wasn't an easy process at the GWAS data spanned over 12 chips and not all of the data is clean.  There was also some converting from Illumina 1/2 to top strand to forward strand issues as well.  I will post the full pipeline I used once this is done.  In any case, I got the majority of the ethnicity data from the GWAS data and tied it back to the CG chip data.  I need to now do a test of association using logistic regression and include ethnicity as a covariate.  The plink command I have is 'plink --file cg_data --logisitic --covar ethnicity.covar --out data.  But, what is ethnicity.covar?  No information on this.

 I now have come across http://www.cureffi.org/2012/10/15/population-covariates-using-1000-genomes/ which seems to be useful and I am now reading through it.  Unfortunately its for use with VCF files, which I don't have so I'm not sure how much of what I need to do changes.  I'm sure the methodology is the same, but the parameters are different.  Maybe this (http://sites.tufts.edu/cbi/files/2013/02/GWAS_Exercise6_Stratification.pdf) will provide me with what I'm looking for, unfortunately, they say to copy the a file containing the covariates, but don't describe HOW to generate the covariates.

I think I'm going to copy of the PC1 and PC2 from the PCA analysis and try and use that.

Thursday, February 9, 2012

Forgotten Blog

I completely forgot I had this blog. I will try to start posting interesting tidbits that I come across. For one, I started a new job that has a lot of work to do with clinical samples. Its a good place to learn some new things and get some things done.

The frustrating part of this is in dealing with specific vendor software that fails to work and the vendor drags their feet in fixing the problem(s). It may not be the vendor rather the tech assigned to work with us. We haven't progressed in three weeks on a specific project because the vendor isn't taking this issue seriously.

Friday, April 23, 2010

Welcome to NGS Bioinformatics

Welcome. For almost 10 years now, I've been working in bioinformatics. At first it started out with basic sequence analysis, but has grown from there. My real interest has always been in DNA sequencing, and now with next-gen sequencing, lots of things are possible. However, the same problems seem to never go away. For me, they have been 1) file format compatibility, 2) application ease-of-use.

File format compatibility is the notion that whenever someone develops a new application, they sometimes insist on using a file format that no one else uses. While there are standard formats out there, they aren't always used. Anytime I want to incorporate a new tool into my pipelines, I have to write a wrapper to convert my data into a format suitable for said tool. I keep building format converters but it seems that I never run out. There is always another one to write.

Application ease-of-use is idea that if you develop an application, and put it out there in the open-source world, you should, at least, make it easy to install, and to use if you want anyone to use it at all. There have been instances where I've spent multiple days trying to get an application to compile or install. Documentation is also important. Without good documentation, figuring out how to program works isn't always fun or easy. Sometimes, I would rather just spend the time writing the software myself.