*BEAST tips

1. 1. Google group ###

Dear all,

I have been playing with a big multi-partition dataset for a while, trying to reduce the amount of calculation time it cost from one week to a reasonable level, without changing the XML file or the computer hardware. I got some quite significant improvements in speed and would like to share what I tried with the community.

Summary:

Things that help: BEAGLE, SSE, multiple BEAGLE instances (for single partition data), no scaling > dynamic > always, a decent GPU device, two or more GPU devices, deliberately designed BEAGLE resource order, one GPU holding multiple partitions, newer version of BEAST (1.8.0), XMP (an Intel memory tech), Linux (comparing to Windows). Things that have little effect: using BEAGLE_GPU without careful consideration, multiple BEAGLE instances (for multi-partition data), the way one launches BEAST, maximum memory allocated to Java, other moderate tasks you are doing on your computer. Things I am not sure: CPU overclocking, multiple threads, power saving features.

Now I am showing my procedures and benchmark results step by step. I label good things in green and bad or irrelevant things in red.

Hardware specifications: CPU: Intel Core i7-4820K (Ivy Bridge-E, 4 core, 8 threads, 3.7-3.9 GHz, 10 MB cache) Memory: 2x8GB dual channel DDR3, 1866 MHz, CL9 1st graphic card: Geforce GTX 650 (384 cores, 1058 MHz, 1 GB GDDR5 @ 80 GB/s) 2nd graphic card: Geforce GT 640 (384 cores, 900 MHz, 2 GB DDR3 @ 28.5 GB/s)

Comment: one power CPU plus two beginner-level GPUs constitute the computation resources. I was trying to do a quad channel memory set, but some DIMMs on the motherboard are bad... So I left it with dual channel. The whole machine cost less than 1000 USD.

Software environment: Ubuntu 13.10 desktop amd64, with Linux kernel version 3.11.0-15. Oracle Java SE runtime version 1.7.0 update 51. NVIDIA driver version 319.60 BEAGLE version 2.1 BEAST version 1.8.0

Dataset: The big dataset contains 13 DNA partitions plus one binary partition, with a total of 8786 sites and 5301 unique patterns. Partition sizes are uneven. The largest partition has 969 patterns, followed by 788, 766, ... with the smallest being 82. 50 million generations of MCMC is necessary to get sufficient samples after convergence. In benchmark the first 100 thousand generations are executed with the same seed, logging every 1 thousand generations.

Before start: 1. In my previous experience, the same analysis runs notably faster in Linux than in Windows on the same machine with the same version of BEAST and BEAGLE. I didn't benchmark this time, for I don't have Windows on this machine. I am not sure if it's general rule. 2. I turned on XMP in BIOS. In my previous benchmark on another machine, this switch reduced running time from 16m54s to 14m57s. 3. I leaved the default and resource-consuming Ubuntu Unity shell on. In my previous benchmark, switching it to the lightweight LXDE did not accelerate the analysis. 4. I tried to overclock CPU to 4.3 GHz, however, BEAST still runs at base frequency of 3.7 GHz on this dataset, sounding as if Turbo Boost didn't work. The CPU load is around ~550% (out of 800%). Meanwhile, with other single partition datasets I tested, BEAST can and always run at the maximum frequency. I wonder if it's a bug, or, BEAST just isn't pushing the CPU to full load on multi-partition datasets. Anyway overclocking isn't typically an option for serious researchers, so I didn't proceed with it. 5. I tried to execute beast directly or used the indirect way java -Xmx####m -jar lib/beast.jar, in order to allocate more memory to BEAST. They both ran at the same pace. More memory didn't seem to help, but sometimes help to overcome overflow problem if the dataset is large. 6. I did not switch off any power-saving options in the operating system or in BIOS, because I noticed that they make little difference. But maybe they help on other machines. 7. I was able to run Firefox, MS Word (yes under wine), Adobe Reader and amny other stuffs along with BEAST analyses without even changing the meter read. This also indicates that BEAST does not drain up all CPU power. 8. I kept the BEAST 1.8.0 defaults - dynamic scaling (-beagle_scaling dynamic) and double precision (-beagle_double). In my previous experience, they performed well comparing to no scaling and single precision.

Baseline: Simply ran BEAST on this dataset, without any additional parameters. It took 12.22 min. The same result was obtained with BEAST version 1.7.5.

BEAGLE (-beagle): Turned on BEAGLE, and the time was reduced to 11.40 min. The increase of performance was notable in this test, and was even much more dramatic in my other analyses. Anyway, I think this option should always be on.

SSE (-beagle_sse): Turn on the SSE flag for CPU, and the time was reduced to 11.18 min. This effect is even larger on other machines I tested. I left this on for the subsequent tests.

Multiple threads (-threads): 2 threads: 12.55 min, 3 threads: 11.17 min, 4 threads: 10.99 min (best performance), 5 threads: 11.20 min, 6 threads: 11.19 min, 7 threads: 11.27 min, 8 threads: 11.18 min. I'm not sure how this works, but just feel that BEAST uses all core/threads by default. I didn't keep going with this option.

Multiple BEAGLE instances (-beagle_instances): 2 instances: 11.54, 4 instances: 12.74, 6 instances: 14.02, 8 instances: 15.44 As I understand this option will set the number you typed for each partition. So I guess number of partitions x number of instances should not exceed number of core/threads of the machine. In this test since there are more partitions (14) than core/threads (8), I'd better not touch this option.

GPU: Turn on GPU (-beagle_gpu) along with SSE, and it cost 11.60 min. Even slightly worse than not using it. By default the first GPU (GTX 650) was set to handle the last partition in the dataset. I'm not sure how BEAST designates this partition. The other GPU (GTX 640) was not used. In my other analyses, especially in a computer with poor CPU and decent GPU, this option could mean a huge acceleration (more than 4 fold in my case). I guess whether GPU speeds up / slows down the calculation depends on the specific machine adn the partitioning scheme of the dataset. Obviously BEAST's default setting is not optimal enough, so I decided to play it a bit more.

CUDA: By default BEAST uses OpenCL instead of CUDA on the GPU. Switching it to CUDA (-beagle_cuda) reduced time to 11.10 min. The reason to choose OpenCL is that ATI cards also support OpenCL (those I didn't try), while CUDA is only supported by NVIDIA cards.

BEAST 1.7.5 with GPU: The analysis just could not proceed. The screen said "underflow". I switched to single precision and it ran, with lots of "underflow" warning. But when I fed Java with more memory, the analysis went smoothly and took 11.60 min. It seems that BEAST 1.8.0 did a better job with 1.7.5 in GPU support.

Smart assignment of BEAGLE resources (-beagle_order): This is the biggest plus I came across so far! High recommendation!

Basic usage: Execute beast -beagle_info, the program listed all available BEAGLE resources. Typically, CPU is #0 and GPU is #1. Suppose there are 5 partitions and one wants to have GPU work on the 3rd partition. There we should type: beast -beagle_order 0,0,1,0,0 input.xml

Note: BEAGLE order by default uses CUDA instead of OpenCL. So just leave as is. Note: SSE is not compatible with BEAGLE order. Just switch off SSE or it will overrides any GPU usages.

I used this option to assign the GTX 650 GPU to the largest partition instead of the last partition. The calculation time then became 9.68 min.

(Pseudo) alternative GPU resource: In the displayed BEAGLE resources (by -beagle_info), each graphic card was listed twice, with different descriptions. BEAST uses the first resource by default. I tried to set the second resource on the largest partition, the calculation slowed down to 10.00 min.

Multiple GPUs: Given I have two graphic cards, I let GTX 650 handle the largest partition and GT 640 handle the 2nd largest partition. The time further decreased to 8.52 min.

One GPU on multiple partitions: I used to learn that one GPU can only handle one partition, however in this version of BEAST and BEAGLE, I found that assigning more than one partition to one GPU gave me a significant boost of power. I assigned the 1st and 2nd largest partitions to GTX 650, the time was 8.38 min. Then assigned the 1st, 2nd and 3rd largest partitions to GTX 650, the time was 8.06 min. Then I went further by have top 4 partitions on to GTX 650, then the performace decreased: 9.94 min.

It appears that a GPU has its limitation of capacity. But we can squeeze as much as possible before the limitation is reached.

Finally I combine multiple GPUs and multiple partitions. I tested several combinations, and found out one combination that has two GPUs taking care of two particular partitions each (GTX 650 on 1st and 2nd largest, GT 640 on 3rd and 4th largest) gave me best performance boost. The final number of this series is 6.75.

I didn't test every single possible combination, for that is too many. It is very likely that there exists a better combination that further reduces the calculation time to a more pleasing level.

Conclusions: A. without GPU: For single-partition dataset, do: beast -beagle_sse -beagle_instances <number of core/threads in your CPU> input.xml For multi-partition dataset, do: beast -beagle_sse input.xml B. with GPU: For single-partition dataset, do: beast -beagle_sse -beagle_gpu -beagle_cuda input.xml For multi-partition dataset, do: beast -beagle_order <a smart scheme based on your tests if you have time> input.xml

That's it. I hope my experience would somehow relieve other BEAST users who are struggling with the long time they have to spend on Bayesian inference. There may be incorrect, arbitrary or case-specific statements in this article, please read with caution and hopefully let me know if I am wrong. Thanks!

Qiyun Zhu

*BEAST tips

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools