I spent the better part of last week wrangling MODIS and other data to get them all into the same projections, so I thought I’d write a quick post on how it all ended up working out, mostly so I don’t forget for next time.
First, for MODIS data there’s a happy tool called the MODIS Reprojection Tool (MRT) which you can download and install (good tutorial on that here). In theory you can just use its GUI to process MODIS data, but that’s only reasonable for a few areas/times, so I highly recommend accessing it in R. For the most part this is easy, because someone else has already dealt with this and written a nice part of the rts package (for ‘raster time series’) called ModisDownload(), which is well described here.
The catch is, as always, if you want to do something out of the ordinary. I wanted annual max and range values for LAI and FPAR for 10 years (2001 – 2011) for the continental US in the Albers Equal Area projection. Sooooo… first (after MRT is installed), you just need to figure out what tiles you want, get the files list, download the .hdf files and mosaic them (HDF4, btw). My code to do that is here on git. The only complicated part was figuring out how to get the directory names off the ftp, really. [just occurred to me that I think in the near future (7/20/1016?) this will require your earthexplorer password. Not sure how that will work].
Then the plan was to read in the hdf files as rasters, apply the qc mask from the qc data, and reproject each day, stack them and calculate max and range. The problems being that hdf4 is no joke, and reprojecting takes a LONG time. But I only needed annual numbers in the new projections, so I decided to read in the hdf files as rasters, apply the masks, stack them, calculate max and range, THEN just reproject the 4 x 10 files (or, actually, stack the 4 files per year and reproject the stack). The problem with that is that as far as I know ModisDownload() doesn’t want you to just use it to convert hdf to tif (or raster), even though MRT lets you do it, and I was too lazy to figure out any of the other ways of dealing with HDF4 in R, so I called MRT from the command line within R to write temporary tifs that I could then read back in with raster. There is undoubtedly a faster way to do this, but this only took a few seconds on my desktop. So my code to do all of that is here. In all I think it took about 15 min per year (not bad since I was only doing 10 years).
Now would be a good time to mention that the raster package also does this crabby thing where it likes to fill up a temporary directory if you’re opening lots of raster files. Sometimes you’ll get random messages like ‘Closing XXXXX.grid’. If you do this enough you can fill up your hard drive or really irritate your HPC admins. So, per a conversation on stackoverflow (which I can't seem to find now), I have a little chunk of code that first sets the temp directory (use rasterOptions(tmpdir = “C:/your_trash_here”)) then empties it out at the end of each loop.
If you look in this aquaxterra directory on git you’ll also find a script to ingest monthly PRISM .asc files, calculate bioclimatic variables and reproject to Albers EA. Enjoy!
Yesterday I started teaching R in my very small class - GEO401: Geography of Plants of North America - in order to introduce (1) programming and (2) species distribution modeling. I mentioned on twitter that I'd done so and it had gone well, and I got a lot of positive responses, so I thought I'd actually post what I'm doing here, in case it's useful to y'all. The lab is basically modeling the climate envelopes of rare plants of Michigan using present day climate (1980-2000) and two future climate scenarios (CESM-BGC RCP 4.5 and 8.5 for 2080-2099).
As a side note, the class has a mix of juniors to grad students, with a wide range of backgrounds in computering. We meet for 1:20 twice per week. We've already discussed lots about controls on species distributions (climate and beyond), SDMs, and why coding is worth at least trying. We started off making sure everyone had R and RStudio loaded, then went through the rforcats.net tutorial, minus the 'lists' part, because I think that's just too confusing for new people. And not essential. We're dedicating three class meetings to this - one for getting set up and running through rforcats, then two to work on getting figures made (using red and blue sticky notes, a la software carpentry), including time with me stepping through the code on the projector. The goal is that the students will produce all of a series of figures of their plants' distributions in class, which will then be the figures they use for their final papers.
Below are links to the lab instructions, the R code, and the climate data files for Michigan. If you have any constructive comments or suggestions on these, I'd love to hear them, and if you'd like to know more about how to download and process climate data for your state or for a different climate model from NEX feel free to contact me and I can send you info and/or code.
DISCLAIMER: This is NOT how I would go about species distribution modeling for a scientific publication. If you're thinking about doing that, a good place to start is Elith & Leathwick (2009). The goal of this class project is just to create simple, understandable climate envelope maps for students who've never used R before.
GEO401 SDM Lab Instructions
SDM R Code
Climate tifs (zipped)
And here are some example figures as generated from the code:
On August 26 this paper, written by me with Rosie Fisher and Peter Lawrence from NCAR, was published in Biogeosciences! The 'in a nutshell' story is that I spent much of my time as a postdoc at NCAR trying to understand how the Community Land Model works with respect to drought deciduous phenology, and how it works in comparison with satellite data (using LAI3g from Zhu et al 2013). After lots of false starts, we found that the CLM lets plants in dry areas put on leaves even when it's super dry out because (massive simplification here) CLM assumes that there is always groundwater available everywhere, and this groundwater can move up to where plants can access it. IRL this is likely true in some places, not true in others, but without better soil moisture data (hello, SMAP satellite!), it's hard to say. What's important is that if dryland leaf phenology is weird, that has implications for the fire cycle, the carbon cycle, and, ultimately, the climate, so we should probably try to get a better handle on this remote corner of the CLM, though it is certainly not alone in that respect. One other key point of this paper is that this issue (anomalous dry season green up in CLM) would be really hard to uncover using a standard 'benchmarking' type of system, or even something like zonal means. We actually had a hard time recognizing that this was serious until we looked at daily CLM output (CLM is usually output monthly, though it runs on a 30 minute timestep). I'm not going to summarize the paper too much here, go read it, it's Open Access, but there is more... because tiny maps in publications make me sad, and because I want people to use this paper to convince funders to fund their work (and mine!) I've posted all of the geotiffs used to make the map figures in the paper below (with permission from Biogeosciences and from Dr. Ranga Myneni for the LAI3g-derived maps). My vision is that if you study, say, Tanzania, you could, without too much heartache, download these figures, subset to your area of interest, and then include in a proposal or paper something like "See! We need more $ to study phenology in Tanzania because look how terribly this big important model does at predicting it!?"
I think that with relatively basic skills in GIS or R you should be able to download these and check them out. If you try it and have issues, please email me and I'll try to help. They're all on a grid from -180 to 180 longitude, -90 to 90 latitude, with a grid cell size of 1.25 by 0.9375. Enjoy!
Note, if you do use these, please make sure to cite my paper, Zhu et al 2013 (they're both OA), and Giglio et al 2013 for the fire data. Note that the LAI3g data is available at much higher resolution (1/12 degree), if you need it, as is the GFED4 fire data (0.25 degree).
Figure 1B - % cover of drought deciduous plants in CLM
Figure 6 = Avg. maximum annual LAI from 1982 to 2010 (so, I calculated the max for each year then took the mean of those 29 max values). I'm not providing the difference maps just to save space, but I'm happy to provide them if you don't want to make them yourself.
Figure 6A - LAI3g
Figure 6C - CLM4.5BGC
Figure 6E - CLM-MOD
Figure 7 = Mode # of peaks in each year in the three data sets, not including spots with an annual range of <1 LAI unit.
Figure 7A- LAI3g
Figure 7B - CLM4.5BGC
Figure 7C - CLM-MOD
Figure 8 = Point-wise correlations between the different data sets. Grid cells set to zero here = not significant.
Figure 8A - LAI3g & CLM4.5BGC
Figure 8B - LAI3g & CLM-MOD
Figure 9 = Looking at average burned area fraction per year in GFED4 and CLM.
Figure 9A - GFED4 - the data is from here.
Figure 9B - CLM4.5BGC
Figure 9C - CLM-MOD
By the time this is posted I will have finished my ESA talk, and a link to the slides is posted here. I'm going to attempt to summarize what I talked about (technically what I'm planning on talking about) here.
One of the things I'm interested in is how imaging spectroscopy (== hyperspectral imagery) can inform fundamental questions in community ecology and biogeography. Since I now live in flat, agriculture-dominated Michigan, I'm curious if this particular landscape can tell us anything useful. Hence, this year at #ESA100 I talked about "Hyperspectral imagery for biodiversity mapping in a wildland-agriculture matrix."
Broadly, one of the questions many of us would like an answer to is "Why do plants grow where they grow?" and, critically, "Can we predict where plants will grow in the future?" One way I like to think about this is that when we have an observed landscape, we can divide its influences into a number of different categories, and we'd like to know which ones dominate a given landscape at a given scale. But five influences is too complicated, so I really like this diagram from Weiher & Keddy (1995, Oikos) putting environmental and competitive diversity on a single access. One way to think of this is then a random landscape would just look like noise, while a strongly environmentally filtered landscape would look extremely patchy (see: California. Also, caveat, things other than environmental filtering (like disturbance) can create a patchy landscape. This is not a perfect setup.) From the biogeography side we have the species-area relationship (SAR), which, while much beleaguered, can reveal interesting things about a landscape, especially when compared to some null expectation. Recently Smith et al (2013, Ecology) extended this idea to a functional-diversity-area relationship (FAR) and they suggest that when an actual FAR falls below their null expectation, that means it's a patchier landscape. And lots of other things (go read the paper!) Though subject to many of the same issues as the FAR, we can again use this null expectation or 'model landscape' approach to test hypotheses about the distribution of data in our landscape of interest.
In this talk used my own local landscape, a mix of agriculture and mesic northern forest, as two different but intermixed landscapes - one, the mesic forest, that we'd expect to be near random. It's real wet and real pleasant in Michigan, so these plants are likely more controlled by competition for light, priority effects, etc, than they are stratified by environmental gradients (though there are probably some of those, too), so we would expect to find high alpha diversity, low beta diversity. In contrast, green agricultural fields are an EXTREMELY patchy landscape in that each field is nearly homogeneous, while between field variability undoubtedly exists due to different crops being planted, different times of planting , and different management practices, so, low alpha diversity, high-ish beta diversity.
I located 'plots' (300 50x50 m plots) across this landscape, calculated principal components on the continuum removed spectral data, and looked at how the two landscapes compare. I'm essentially going to treat the first three PCs as traits. I'm fully aware that this negates a lot of the trait-based work that cares about the influence of traits on physiology, but for my purposes I'm just going to worry about spatial heterogeneity for the time being.
So, what to the PCs look like and is their variance different between forest and ag? Yes, indeed. As expected, there's a lot of variation within the forest plots, and very little within the agriculture. This is no surprise, but it should be reassuring to the community ecologists that remote sensing can say something they believe. The FAR curves also match what we would expect: given the size of my plots, the variance within a single plot is essentially equal to the variance across all plots when the data is randomized (gray line on Actual v. Predicted FARs slide looking at variance), so that line is totally flat, similar to the forest line (green) while the ag line (purple) rises steeply. In contrast, the CHV lines keep going up because the range of values for all the points is really big. Enjoy and feel free to ask questions!
ERSAM Lab Thoughts
This is where we'll be posting things that are not research, but more than 140 characters. If you want to know what we're working on now, this would be a good place to look. Code will be linked to here and posted on github.