[galaxy-user] Experience with Loading NGS data on standalone instance of galaxy

Greg Von Kuster ghv2 at psu.edu
Fri Oct 2 15:30:32 EDT 2009


Please type the following in your galaxy install directory, and let me 
know what you get:

hg heads


Thanks

Abhishek Pratap wrote:
> Hi Greg
> 
> Unfortunately it is not working for me. I made sure I cleared my
> browser cache before re-viewing it.
> 
> I have set the option as suggested by you in the universe_wsgi.ini file.
> 
> -Abhi
> 
> On Fri, Oct 2, 2009 at 2:53 PM, Greg Von Kuster <ghv2 at psu.edu> wrote:
>> Hello Abhishek,
>>
>> Add this to your universe_wsgi.ini file:
>>
>> allow_library_path_paste = True
>>
>> Then, clicking the down-arrow on the upload form
>>
>> Create new data library datasets  ▼
>>
>> will give you 4 options, 1 of which is:
>>
>> Upload files from file system paths
>>
>> Greg Von Kuster
>> Galaxy Development Team
>>
>>
>> Abhishek Pratap wrote:
>>> Hi Greg
>>>
>>> I have updated my galaxy rep to changeset 2825. I dont see the
>>> checkbox on the "Upload File" page. Am I missing something ?
>>>
>>> Thanks,
>>> -Abhi
>>>
>>> On Fri, Oct 2, 2009 at 10:21 AM, Greg Von Kuster <ghv2 at psu.edu> wrote:
>>>> Change set 2812 will be included in a release to the distribution today -
>>>> here are details of a new option that we're hoping will provide what is
>>>> needed for most labs.
>>>>
>>>> Add a new option, 'allow_library_path_paste' that adds a new upload page
>>>> ("Upload files from file system paths") to the admin-side library upload
>>>> pages.
>>>> This form contains a textarea that allows Galaxy admins to paste any
>>>> number
>>>> of
>>>> file system paths (files or directories) from which Galaxy will import
>>>> library
>>>> datasets, saving the directory structure (if desired).  Since such
>>>> ability
>>>> allows admins access to any file on the Galaxy server which is readable
>>>> by
>>>> Galaxy's system user, this option is disabled by default, and system
>>>> administrators should take care in assigning Galaxy administrators when
>>>> this
>>>> feature is enabled.  Controls on what files are accessible to this tool
>>>> based
>>>> on ownership or other properties can be added at a later date if there is
>>>> sufficient interest for such features.
>>>>
>>>> This commit also includes a checkbox on the "Upload directory of files"
>>>> page
>>>> (as well as the new "Upload files from file system paths" page above)
>>>> that
>>>> will
>>>> prevent Galaxy from copying data to its files directory (by default,
>>>> 'database/files/').  This is useful for large library datasets that live
>>>> in
>>>> their own managed locations on the file system, this will prevent the
>>>> existence
>>>> of duplicate copies of datasets (but means administrators must take care
>>>> to
>>>> manage data - moving or removing the data from its Galaxy-external
>>>> location
>>>> will render these datasets invalid within Galaxy).
>>>>
>>>> One unique feature to be aware of: when using the "Copy data into
>>>> Galaxy?"
>>>> checkbox on the "Upload directory of files" page, any symbolic links
>>>> encountered in the chosen import directory will be made absolute and
>>>> dereferenced ONCE.  This allows administrators to link large datasets to
>>>> the
>>>> import directory, rather than having to make full copies, while being
>>>> able
>>>> to
>>>> delete such links after importing.  Only the first symlink (the one in
>>>> the
>>>> import directory itself) is dereferenced; all others remain.  See the
>>>> following
>>>> for an example:
>>>>
>>>> library_import_dir = /galaxy/import
>>>>
>>>> % ls -lR /galaxy/import
>>>> /galaxy/import:
>>>> total 6
>>>> drwxr-xr-x   2 nate     nate         512 Oct  1 11:31 link/
>>>>
>>>> /galaxy/import/link:
>>>> total 10
>>>> lrwxrwxrwx   1 nate     nate          71 Oct  1 10:38 1.bed ->
>>>> ../../../home/nate/galaxy/test-data/1.bed
>>>> lrwxrwxrwx   1 nate     nate          60 Oct  1 10:38 2.bed ->
>>>> /home/nate/galaxy/test-data/2.bed
>>>> lrwxrwxrwx   1 nate     nate          11 Oct  1 10:38 3.bed ->
>>>> ../../3.bed
>>>> lrwxrwxrwx   1 nate     nate          35 Oct  1 11:30 4.bed ->
>>>> ../../galaxy_symlink/test-data/4.bed
>>>> lrwxrwxrwx   1 nate     nate          41 Oct  1 11:31 5.bed ->
>>>> /galaxy/galaxy_symlink/test-data/5.bed
>>>>
>>>> % ls -l /galaxy/3.bed
>>>> lrwxrwxrwx   1 nate     nate          60 Oct  1 10:39
>>>> /galaxy/3.bed ->
>>>> /home/nate/galaxy/test-data/3.bed
>>>>
>>>> % ls -l /galaxy/galaxy_symlink
>>>> lrwxrwxrwx   1 nate     nate          44 Oct  1 11:30
>>>> /galaxy/galaxy_symlink
>>>> -> /home/nate/galaxy/
>>>>
>>>> In this example,
>>>>
>>>> 1.bed is a relative symbolic link to the real 1.bed.
>>>>
>>>> 2.bed is an absolute symlink to the real 2.bed.
>>>>
>>>> 3.bed is a relative symlink to ../../3.bed, aka /galaxy/3.bed, which
>>>> itself
>>>> is
>>>> a symlink to the real 3.bed.
>>>>
>>>> 4.bed is a relative symlink which follows another symlink
>>>> (/galaxy/galaxy_symlink) to the real 4.bed.
>>>>
>>>> 5.bed is an absolute symlink in the same fashion as 4.bed
>>>>
>>>> If the 'link' server directory is chosen on the "Upload directory of
>>>> files"
>>>> page, and "Copy data into Galaxy?" is checked "No", the following files
>>>> will
>>>> be
>>>> referenced by Galaxy:
>>>>
>>>> /home/nate/galaxy/test-data/1.bed
>>>> /home/nate/galaxy/test-data/2.bed
>>>> /galaxy/3.bed
>>>> /galaxy/galaxy_symlink/test-data/4.bed
>>>> /galaxy/galaxy_symlink/test-data/5.bed
>>>>
>>>> The Galaxy administrator may now safely delete /galaxy/import/link, but
>>>> should
>>>> take care not to remove the referenced symbolic links (/galaxy/3.bed,
>>>> /galaxy/galaxy_symlink).
>>>>
>>>> Not all symbolic links are dereferenced because it is assumed that if an
>>>> administrator links to a path in the import directory which itself is (or
>>>> contains) links, that is the preferred path for accessing the data.
>>>>
>>>>
>>>>
>>>> Oliver Hofmann wrote:
>>>>> Dear all,
>>>>>
>>>>>
>>>>> to echo what Abhi said: we are also currently looking of ways to
>>>>> automatically import data sets (libraries) into Galaxy without having to
>>>>> manually trigger the import via the administration interface, and
>>>>> ideally
>>>>> while keeping the data in the original place. The idea here is to have
>>>>> multiple tools all point at the original 'source data' without having to
>>>>> replicate terabytes of data.
>>>>>
>>>>> Not quite sure how feasible this is in practice, but it certainly would
>>>>> be
>>>>> incredibly helpful.
>>>>>
>>>>> Best,
>>>>>
>>>>>    Oliver
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 28 Sep 2009, at 14:24, Abhishek Pratap wrote:
>>>>>
>>>>>> HI Greg
>>>>>>
>>>>>> Thanks for a quick reply and making some requested changes. However I
>>>>>> am
>>>>>> not still sure if importing NGS data will help in long run.
>>>>>>
>>>>>> For Centers generating NGS data which could 2-3 T.B / week depending on
>>>>>> no. of sequencers I think importing another copy of raw data into
>>>>>> galaxy
>>>>>> workspace will be asking for lot of disk space. I understand it is a
>>>>>> neat
>>>>>> way of doing things as it becomes agnostic of the raw data location
>>>>>>  but
>>>>>> might not be the best way for handling huge data in long run for
>>>>>> centers
>>>>>> like ours.
>>>>>>
>>>>>> Please correct me if I am wrong. I think we could also have a simple
>>>>>> option without having to import the data and just using it for analysis
>>>>>> from
>>>>>> the current location, also storing results at the same location. That
>>>>>> way in
>>>>>> future even if the data set is moved analysis also stays with it.
>>>>>>
>>>>>> Let me know what you feel. I will be happy to know if there are any
>>>>>> other
>>>>>> smart reasons of importing the data in galaxy workspace just for
>>>>>> curiosity
>>>>>> sake.
>>>>>>
>>>>>> Thanks,
>>>>>> -Abhi
>>>>>>
>>>>>> On Mon, Sep 28, 2009 at 9:28 AM, Greg Von Kuster <ghv2 at psu.edu> wrote:
>>>>>> Hello Abhishek,
>>>>>>
>>>>>> The Galaxy distribution includes the enhancements to which I previously
>>>>>> referred for uploading history files.  Uploading files to a history
>>>>>> now
>>>>>> creates a Galaxy job just like any other tool, and can be run on a
>>>>>> cluster
>>>>>> node, allowing upload of very large files.  The initial pass of this
>>>>>> work is
>>>>>> also completed for uploading to a Data Library, but this enhancement is
>>>>>> still in test, so it should soon be available in the distribution.
>>>>>>
>>>>>> Do you want to avoid having to import at all (e.g. allow Galaxy to
>>>>>> refer
>>>>>> to datasets that live in their original locations)?  This is not
>>>>>> currently
>>>>>> possible, but if this is what you are looking for, we can consider some
>>>>>> additional options on the current upload form, or possibly a new,
>>>>>> separate
>>>>>> form.
>>>>>>
>>>>>>
>>>>>> Greg Von Kuster
>>>>>> Galaxy Development Team
>>>>>>
>>>>>>
>>>>>> Abhishek Pratap wrote:
>>>>>> Hi Greg, Anton and all
>>>>>>
>>>>>> Just wondering if there has been any progress made on this end. I am
>>>>>> sorry I was not able to follow it up on Assaf's suggestion due to other
>>>>>> things at work.
>>>>>>
>>>>>> I did try the latest version of galaxy and looks like the files are
>>>>>> still
>>>>>> transferred over HTTP before they could be used in the galaxy
>>>>>> workspace.
>>>>>> Also I would again like to highlight that many labs might want to use
>>>>>> the
>>>>>> local instance of galaxy and prefer to point to a local path where the
>>>>>> file
>>>>>> is being stored. That way we will have both the benefits of using a
>>>>>> cool GUI
>>>>>> and process data stored locally.
>>>>>>
>>>>>> Let me know if you guys need some feedback or have more questions. I
>>>>>> will
>>>>>> be happy to discuss them.
>>>>>>
>>>>>> best,
>>>>>> -Abhi
>>>>>>
>>>>>> On Tue, Jul 21, 2009 at 4:26 PM, Greg Von Kuster <ghv2 at psu.edu
>>>>>> <mailto:ghv2 at psu.edu>> wrote:
>>>>>>
>>>>>>   Hello Abishek,
>>>>>>
>>>>>>   We are currently in the process of significantly enhancing the
>>>>>>   current Galaxy upload utilities, and the new version should
>>>>>>   eliminate the issue you've raised about the time needed to upload
>>>>>>   large files via HTTP ( not for making an initial copy of the file in
>>>>>>   the Galaxy environment ). However, it will probably not be ready for
>>>>>>   release for a few more weeks, so if you can take advantage of
>>>>>>   Assaf's script in the meantime, that's great. ¨ÜI can't guarantee
>>>>>>   that all Galaxy features will function correctly if you do this
>>>>>> though.
>>>>>>
>>>>>>   Assaf, have you found that using your script breaks anything?
>>>>>>
>>>>>>   Also, if you upload a file to a library rather than a history,
>>>>>>   multiple users can "import" the library dataset into their history
>>>>>>   for analysis, but there is only 1 file on disk ( users are pointing
>>>>>>   to it from their histories ). ¨ÜBut uploading a file to a history
>>>>>>   will create a new copy of the file each time it is uploaded.
>>>>>>
>>>>>>   Greg Von Kuster
>>>>>>   Galaxy Development Team
>>>>>>
>>>>>>
>>>>>>
>>>>>>   Abhishek Pratap wrote:
>>>>>>
>>>>>>       Hi All
>>>>>>
>>>>>>       @Greg : Please find my comments below.
>>>>>>
>>>>>>       On Tue, Jul 21, 2009 at 10:44 AM, Greg Von Kuster<ghv2 at psu.edu
>>>>>>       <mailto:ghv2 at psu.edu>> wrote:
>>>>>>
>>>>>>           Hello Abhi,
>>>>>>
>>>>>>           Can you clarify the steps you took that produced the
>>>>>>           behavior? ǃÜSee my
>>>>>>
>>>>>>           comments below.
>>>>>>
>>>>>>           Anton Nekrutenko wrote:
>>>>>>
>>>>>>               Abhishek:
>>>>>>
>>>>>>               Let talk. This is the area of active current
>>>>>>               development. We are ǃÜlooking
>>>>>>
>>>>>>               at implementing a universal fastq-like format or
>>>>>>               supporting ǃÜmultiple
>>>>>>
>>>>>>               formats. Perhaps we should join efforts in ironing
>>>>>> out
>>>>>>               ǃÜspecifications.
>>>>>>
>>>>>>
>>>>>>               anton
>>>>>>               galaxy team
>>>>>>
>>>>>>
>>>>>>               On Jul 20, 2009, at 5:18 PM, Abhishek Pratap
>>>>>> wrote:
>>>>>>
>>>>>>                   Hi All
>>>>>>
>>>>>>
>>>>>>                   I recently came to know about NGS analysis
>>>>>> on galaxy
>>>>>>                   during ISMB.
>>>>>>                   Getting excited I tried couple of things
>>>>>> basically
>>>>>>                   to play with it.
>>>>>>
>>>>>>                   Few comments : I may have interepretted
>>>>>> something
>>>>>>                   described below in a
>>>>>>                   wrong way. My apologies before hand.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                   On a standalone installation of galaxy while
>>>>>> I was
>>>>>>                   trying to explore
>>>>>>                   one FASTQ(sequence) file. It takes
>>>>>> considerable (>
>>>>>>                   20 min) for a fastq
>>>>>>                   file to get uploaded (2 GB).
>>>>>>
>>>>>>           Are you using the Galaxy upload utility to create an
>>>>>> item in
>>>>>>           your history
>>>>>>           that points to the dataset file on disk?
>>>>>>
>>>>>>
>>>>>>       Yes that is precisely correct, I am trying to upload a solexa
>>>>>> FASTQ
>>>>>>       file but on a standalone galaxy installation from my local
>>>>>> file
>>>>>>       system.
>>>>>>
>>>>>>           I am not sure what is the rationale
>>>>>>
>>>>>>                   behind that. Ideally I think there should be
>>>>>> no need
>>>>>>                   to upload such
>>>>>>                   heavy files into the workspace.
>>>>>>
>>>>>>           A data file that originates from a place external to
>>>>>> Galaxy
>>>>>>           must be uploaded
>>>>>>           into Galaxy so that the disk file can be placed in the
>>>>>>           location configured
>>>>>>           in the Galaxy config file. ǃÜAlso, when data is
>>>>>> uploaded to
>>>>>>
>>>>>>           Galaxy ( either
>>>>>>           to a history or a library ), several database table
>>>>>> settings
>>>>>>           are created
>>>>>>           that are used by various Galaxy features.
>>>>>>
>>>>>>           They could actually be used straight
>>>>>>
>>>>>>
>>>>>>       Thanks for the clarification but I am not sure this will help
>>>>>> a
>>>>>>       lot of
>>>>>>       people who are interested to install and run galaxy locally
>>>>>>       mainly for
>>>>>>       the following reasons. May be it is just local to me.
>>>>>>
>>>>>>       A. We already one instance of data saved on the local file
>>>>>> system
>>>>>>       B. Making another copy via galaxy will eat away a lot of space
>>>>>>       in long run.
>>>>>>       C. The time needed to import the files into galaxy space is
>>>>>> huge
>>>>>>
>>>>>>                   away by the path specified.
>>>>>>
>>>>>>           What do you mean by "the path specified"?
>>>>>>
>>>>>>
>>>>>>
>>>>>>       Well what I mean was a way to specify the path of the file/run
>>>>>>       on the
>>>>>>       lcoal file system and galaxy could directly pick it up from
>>>>>> there
>>>>>>       rather than uploading it into its own space. Now I understand
>>>>>> this
>>>>>>       might not work based on the way the system was designed.
>>>>>>
>>>>>>
>>>>>>           Also is there any way to access the
>>>>>>
>>>>>>                   scripts for analysis on the command line. I
>>>>>> know
>>>>>>                   this undermines the
>>>>>>                   main aim of working with galaxy but rite now
>>>>>> I am
>>>>>>                   concerned about the
>>>>>>                   performance/time.
>>>>>>
>>>>>>           You should be able to run any Galaxy tool from the
>>>>>> command
>>>>>>           line as long as
>>>>>>           you have all of the tool's required binaries in your
>>>>>> path.
>>>>>>           ǃÜHowever, running
>>>>>>
>>>>>>           a tool from within Galaxy should generally not be any
>>>>>> slower
>>>>>>           than running it
>>>>>>           outside of Galaxy, depending, of course, on what you are
>>>>>> doing.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       Ok I was under the impression that running from SHELL will
>>>>>> eliminate
>>>>>>       the step of uploading them into galaxy file space.
>>>>>>
>>>>>>
>>>>>>       -Abhi
>>>>>>
>>>>>>                   I will be happy to discuss more about this
>>>>>> in case
>>>>>>                   you have some
>>>>>>                   comments/questions for me.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                   Best,
>>>>>>                   -Abhi
>>>>>>
>>>>>>
>>>>>>
>>>>>>                   -----------------------------
>>>>>>
>>>>>>                   Abhishek Pratap
>>>>>>
>>>>>>                   Bioinformatics Software Engineer
>>>>>>
>>>>>>                   Institute for Genome Sciences
>>>>>>
>>>>>>                   School of Medicine, Univ of Maryland
>>>>>>
>>>>>>                   801, W. Baltimore Street, Baltimore, MD
>>>>>> 21209
>>>>>>
>>>>>>                   Ph: (+1)-410-706-2296
>>>>>>
>>>>>>                   www.igs.umaryland.edu/
>>>>>> <http://www.igs.umaryland.edu/>
>>>>>>                  
>>>>>> _______________________________________________
>>>>>>                   galaxy-user mailing list
>>>>>>                   galaxy-user at bx.psu.edu
>>>>>> <mailto:galaxy-user at bx.psu.edu>
>>>>>>
>>>>>>
>>>>>> http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
>>>>>>
>>>>>>               Anton Nekrutenko
>>>>>>               http://nekrut.bx.psu.edu
>>>>>>               http://galaxyproject.org
>>>>>>
>>>>>>               _______________________________________________
>>>>>>               galaxy-user mailing list
>>>>>>               galaxy-user at bx.psu.edu
>>>>>> <mailto:galaxy-user at bx.psu.edu>
>>>>>>
>>>>>>              
>>>>>> http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> galaxy-user mailing list
>>>>>> galaxy-user at bx.psu.edu
>>>>>> http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
>>>>> --
>>>>> Research Associate    Department of Biostatistics
>>>>> Associate Director    Bioinformatics Core
>>>>>                      Harvard School of Public Health
>>>>> Skype: ohofmann       Phone: +1 (617) 365 0984
>>>>>
>>>>>
>>>>>
>>>
>>




More information about the galaxy-user mailing list