Using grep or awk to print text before and after match, to a specific beginning and ending string
Using grep or awk to print text before and after match, to a specific beginning and ending string
I am trying to extract entries from within a large Genbank file, with many thousands of entries. For a search string, Im using a unique gene name - that works fine. The tricky bit is that Id like to print the entire entry for that particular gene - entries begin with the word LOCUS and end with , and contain the gene name at some point between. I understand that I can use greps flags -A, -B, and -C to print n lines afterbefore a string match, but the actual entries are variable in length. How would I use grep to search for my string gene name, and then print all the lines before the match up to and including a line beginning with LOCUS, and all lines up to and including a line indicating the end of the entry, which is just Im open to all suggestions - is there a way to have the -A and -B flags match a strings LOCUS and or something to that effect Should I be using awk instead Edit: This is a simplified input example - each record begins with LOCUS and ends with . This example contains three records: LOCUS scaffold1size100 genegene1 GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA LOCUS scaffold99size genegene2 CGTTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA LOCUS scaffold199size1000 genegene3 AGTTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA AGTTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA AGTTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA I would like to search for gene2 and print the text from the first instance of LOCUS before the match through the first in after the match. Ideally, I would like the following output: LOCUS scaffold99size genegene2 CGTTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA Thanks again for your help
This is fairly easy in awk: awk -vtargetfox LOCUS in_gene 1 in_gene if gene gene 0; else gene gene ORS 0; 0 target found 1 if in_gene found print gene gene ; in_gene 0; found 0 Set the target variable to the string gene name you are searching for. I used fox as an example. When we see the word LOCUS, we know were looking at a gene. As long as were looking at a gene, accumulate its contents. The first line the LOCUS line just gets assigned to the gene variable. Thereafter, we add append the current line 0 to the gene variable with a newline ORSOutput Record Separator between the old value and the added value. If the current gene contains the gene name youre looking for, set the found flag. We have to use the rather ugly to search for a . When we see one, we check whether the current gene is the one were looking for, and, if so, print it. Then reset to continue searching. If youre sure that the gene youre looking for occurs only once in the file or if you want only the first occurrence, you could just exit here.
Комментарии
Отправить комментарий