function tripal_feature_load_gff3

2.x tripal_feature.gff_loader.inc	`tripal_feature_load_gff3($gff_file, $organism_id, $analysis_id, $add_only = 0, $update = 1, $refresh = 0, $remove = 0, $use_transaction = 1, $target_organism_id = NULL, $target_type = NULL, $create_target = 0, $start_line = 1, $landmark_type = '', $alt_id_attr = '', $create_organism = FALSE, $re_mrna = '', $re_protein = '', $job = NULL)`
1.x gff_loader.inc	`tripal_feature_load_gff3($gff_file, $organism_id, $analysis_id, $add_only = 0, $update = 0, $refresh = 0, $remove = 0, $use_transaction = 1, $target_organism_id = NULL, $target_type = NULL, $create_target = 0, $start_line = 1, $landmark_type = '', $alt_id_attr = '', $create_organism = FALSE, $job = NULL)`

Actually load a GFF3 file. This is the function called by tripal jobs

Parameters

$gff_file: The full path to the GFF file on the filesystem

$organism_id: The organism_id of the organism to which the features in the GFF belong

$analysis_id: The anlaysis_id of the analysis from which the features in the GFF were generated

$add_only: Set to 1 if feature should be added only. In the case where a feature already exists, it will not be updated. Default is 0

$update: Set to 1 to update existing features. New features will be added. Attributes for a feature that are not present in the GFF but which are present in the database will not be altered. Default is 1

$refresh: Set to 1 to update existing features. New features will be added. Attributes for a feature that are not present in the GFF but which are present in the database will be removed. Default is 0

$remove: Set to 1 to remove features present in the GFF file that exist in the database. Default is 0.

$use_transaction: Set to 1 to use a transaction when loading the GFF. Any failure during loading will result in the rollback of any changes. Default is 1.

$target_organism_id: If the GFF file contains a 'Target' attribute then the feature and the target will have an alignment created, but to find the proper target feature the target organism must also be known. If different from the organism specified for the GFF file, then use this argument to specify the target organism. Only use this argument if all target sequences belong to the same species. If the targets in the GFF file belong to multiple different species then the organism must be specified using the 'target_organism=genus:species' attribute in the GFF file. Default is NULL.

$target_type: If the GFF file contains a 'Target' attribute then the feature and the target will have an alignment created, but to find the proper target feature the target organism must also be known. This can be used to specify the target feature type to help with identification of the target feature. Only use this argument if all target sequences types are the same. If the targets are of different types then the type must be specified using the 'target_type=type' attribute in the GFF file. This must be a valid Sequence Ontology (SO) term. Default is NULL

$create_target: Set to 1 to create the target feature if it cannot be found in the database. Default is 0

$start_line: Set this to the line in the GFF file where importing should start. This is useful for testing and debugging GFF files that may have problems and you want to start at a particular line to speed testing. Default = 1

$landmark_type: Use this argument to specify a Sequence Ontology term name for the landmark sequences in the GFF fie (e.g. 'chromosome'), if the GFF file contains a '##sequence-region' line that describes the landmark sequences. Default = ''

$alt_id_attr: Sometimes lines in the GFF file are missing the required ID attribute that specifies the unique name of the feature. If so, you may specify the name of an existing attribute to use for the ID.

$create_organism: The Tripal GFF loader supports the "organism" attribute. This allows features of a different organism to be aligned to the landmark sequence of another species. The format of the attribute is "organism=[genus]:[species]", where [genus] is the organism's genus and [species] is the species name. Check this box to automatically add the organism to the database if it does not already exists. Otherwise lines with an oraganism attribute where the organism is not present in the database will be skipped.

$re_mrna A: regular expression to extract portions from mRNA id

$re_protein A: replacement string to generate the protein id

$job: The tripal job_id. Only used by the Tripal Jobs subsystem.

File

tripal_feature/includes/tripal_feature.gff_loader.inc, line 452: Provides gff3 loading functionality. Creates features based on their specification in a GFF3 file.

Code

function tripal_feature_load_gff3($gff_file, $organism_id, $analysis_id, 
$add_only = 0, $update = 1, $refresh = 0, $remove = 0, $use_transaction = 1, 
$target_organism_id = NULL, $target_type = NULL, $create_target = 0, 
$start_line = 1, $landmark_type = '', $alt_id_attr = '', $create_organism = FALSE, 
$re_mrna = '', $re_protein = '', $job = NULL) {

  $ret = array();
  $date = getdate();

  // An array that stores CVterms that have been looked up so we don't have
  // to do the database query every time.
  $cvterm_lookup = array();

  // An array that stores Landmarks that have been looked up so we don't have
  // to do the database query every time.
  $landmark_lookup = array();

  // empty the temp tables
  $sql = "DELETE FROM {tripal_gff_temp}";
  chado_query($sql);
  $sql = "DELETE FROM {tripal_gffcds_temp}";
  chado_query($sql);
  $sql = "DELETE FROM {tripal_gffprotein_temp}";
  chado_query($sql);

  // begin the transaction
  $transaction = null;
  if ($use_transaction) {
    $transaction = db_transaction();
    print "\nNOTE: Loading of this GFF file is performed using a database transaction. \n" .
      "If the load fails or is terminated prematurely then the entire set of \n" .
      "insertions/updates is rolled back and will not be found in the database\n\n";
  }
  try {

    // check to see if the file is located local to Drupal
    $dfile = $_SERVER['DOCUMENT_ROOT'] . base_path() . $gff_file;
    if (!file_exists($dfile)) {
      // if not local to Drupal, the file must be someplace else, just use
      // the full path provided
      $dfile = $gff_file;
    }
    if (!file_exists($dfile)) {
      tripal_report_error('tripal_feature', TRIPAL_ERROR, "Cannot find the file: %dfile", 
      array('%dfile' => $dfile));
      return 0;
    }

    print "Opening $gff_file\n";

    //$lines = file($dfile,FILE_SKIP_EMPTY_LINES);
    $fh = fopen($dfile, 'r');
    if (!$fh) {
      tripal_report_error('tripal_feature', TRIPAL_ERROR, "cannot open file: %dfile", 
      array('%dfile' => $dfile));
      return 0;
    }
    $filesize = filesize($dfile);

    // get the controlled vocaubulary that we'll be using.  The
    // default is the 'sequence' ontology
    $sql = "SELECT * FROM {cv} WHERE name = :cvname";
    $cv = chado_query($sql, array(':cvname' => 'sequence'))->fetchObject();
    if (!$cv) {
      tripal_report_error('tripal_feature', TRIPAL_ERROR, 
      "Cannot find the 'sequence' ontology", array());
      return '';
    }
    // get the organism for which this GFF3 file belongs
    $sql = "SELECT * FROM {organism} WHERE organism_id = :organism_id";
    $organism = chado_query($sql, array(':organism_id' => $organism_id))->fetchObject();

    $interval = intval($filesize * 0.0001);
    if ($interval == 0) {
      $interval = 1;
    }
    $in_fasta = 0;
    $line_num = 0;
    $num_read = 0;
    $intv_read = 0;

    // prepare the statement used to get the cvterm for each feature.
    $sel_cvterm_sql = "
      SELECT CVT.cvterm_id, CVT.cv_id, CVT.name, CVT.definition,
        CVT.dbxref_id, CVT.is_obsolete, CVT.is_relationshiptype
      FROM {cvterm} CVT
        INNER JOIN {cv} CV on CVT.cv_id = CV.cv_id
        LEFT JOIN {cvtermsynonym} CVTS on CVTS.cvterm_id = CVT.cvterm_id
      WHERE CV.cv_id = :cv_id and
       (lower(CVT.name) = lower(:name) or lower(CVTS.synonym) = lower(:synonym))
     ";

    // If a landmark type was provided then pre-retrieve that.
    if ($landmark_type) {
      $query = array(
        ':cv_id' => $cv->cv_id,
        ':name' => $landmark_type,
        ':synonym' => $landmark_type
      );
      $result = chado_query($sel_cvterm_sql, $query);
      $landmark_cvterm = $result->fetchObject();
      if (!$landmark_cvterm) {
        tripal_report_error('tripal_feature', TRIPAL_ERROR, 
        'cannot find landmark feature type \'%landmark_type\'.', 
        array('%landmark_type' => $landmark_type));
        return '';
      }
    }

    // iterate through each line of the GFF file
    print "Parsing Line $line_num (0.00%). Memory: " . number_format(memory_get_usage()) . " bytes\r";
    while ($line = fgets($fh)) {
      $line_num++;
      $size = drupal_strlen($line);
      $num_read += $size;
      $intv_read += $size;

      if ($line_num < $start_line) {
        continue;
      }

      // update the job status every 1% features
      if ($job and $intv_read >= $interval) {
        $intv_read = 0;
        $percent = sprintf("%.2f", ($num_read / $filesize) * 100);
        print "Parsing Line $line_num (" . $percent . "%). Memory: " . number_format(memory_get_usage()) . " bytes.\r";
        tripal_set_job_progress($job, intval(($num_read / $filesize) * 100));
      }

      // check to see if we have FASTA section, if so then set the variable
      // to start parsing
      if (preg_match('/^##FASTA/i', $line)) {
        print "Parsing FASTA portion...\n";
        if ($remove) {
          // we're done because this is a delete operation so break out of the loop.
          break;
        }
        tripal_feature_load_gff3_fasta($fh, $interval, $num_read, $intv_read, $line_num, $filesize, $job);
        continue;
      }
      // if the ##sequence-region line is present then we want to add a new feature
      if (preg_match('/^##sequence-region (.*?) (\d+) (\d+)$/i', $line, $region_matches)) {
        $rid = $region_matches[1];
        $rstart = $region_matches[2];
        $rend = $region_matches[3];
        if ($landmark_type) {
          tripal_feature_load_gff3_feature($organism, $analysis_id, $landmark_cvterm, $rid, 
          $rid, '', 'f', 'f', 1, 0);
        }
        continue;
      }

      // skip comments
      if (preg_match('/^#/', $line)) {
        continue;
      }

      // skip empty lines
      if (preg_match('/^\s*$/', $line)) {
        continue;
      }

      // get the columns
      $cols = explode("\t", $line);
      if (sizeof($cols) != 9) {
        tripal_report_error('tripal_feature', TRIPAL_ERROR, 'improper number of columns on line %line_num', 
        array('%line_num' => $line_num));
        return '';
      }

      // get the column values
      $landmark = $cols[0];
      $source = $cols[1];
      $type = $cols[2];
      $start = $cols[3];
      $end = $cols[4];
      $score = $cols[5];
      $strand = $cols[6];
      $phase = $cols[7];
      $attrs = explode(";", $cols[8]); // split by a semicolon

      // ready the start and stop for chado.  Chado expects these positions
      // to be zero-based, so we substract 1 from the fmin
      $fmin = $start - 1;
      $fmax = $end;
      if ($end < $start) {
        $fmin = $end - 1;
        $fmax = $start;
      }

      // format the strand for chado
      if (strcmp($strand, '.') == 0) {
        $strand = 0;
      }
      elseif (strcmp($strand, '+') == 0) {
        $strand = 1;
      }
      elseif (strcmp($strand, '-') == 0) {
        $strand = -1;
      }
      if (strcmp($phase, '.') == 0) {
        $phase = '';
      }
      if (array_key_exists($type, $cvterm_lookup)) {
        $cvterm = $cvterm_lookup[$type];
      }
      else {
        $result = chado_query($sel_cvterm_sql, array(':cv_id' => $cv->cv_id, ':name' => $type, ':synonym' => $type));
        $cvterm = $result->fetchObject();
        $cvterm_lookup[$type] = $cvterm;
        if (!$cvterm) {
          tripal_report_error('tripal_feature', TRIPAL_ERROR, 'cannot find feature term \'%type\' on line %line_num of the GFF file', 
          array('%type' => $type, '%line_num' => $line_num));
          return '';
        }
      }

      // break apart each of the attributes
      $tags = array();
      $attr_name = '';
      $attr_uniquename = '';
      $attr_residue_info = '';
      $attr_locgroup = 0;
      $attr_fmin_partial = 'f';
      $attr_fmax_partial = 'f';
      $attr_is_obsolete = 'f';
      $attr_is_analysis = 'f';
      $attr_others =[];
      $residues = '';

      // the organism to which a feature belongs can be set in the GFF
      // file using the 'organism' attribute.  By default we
      // set the $feature_organism variable to the default organism for the landmark
      $attr_organism = '';
      $feature_organism = $organism;

      foreach ($attrs as $attr) {
        $attr = rtrim($attr);
        $attr = ltrim($attr);
        if (strcmp($attr, '') == 0) {
          continue;
        }
        if (!preg_match('/^[^\=]+\=.+$/', $attr)) {
          tripal_report_error('tripal_feature', TRIPAL_ERROR, 'Attribute is not correctly formatted on line %line_num: %attr', 
          array('%line_num' => $line_num, '%attr' => $attr));
          return '';
        }

        // break apart each tag
        $tag = preg_split("/=/", $attr, 2); // split by equals sign

        // multiple instances of an attribute are separated by commas
        $tag_name = $tag[0];
        if (!array_key_exists($tag_name, $tags)) {
          $tags[$tag_name] = array();
        }
        $tags[$tag_name] = array_merge($tags[$tag_name], explode(",", $tag[1])); // split by comma


        // replace the URL escape codes for each tag
        for ($i = 0; $i < count($tags[$tag_name]); $i++) {
          $tags[$tag_name][$i] = urldecode($tags[$tag_name][$i]);
        }

        // get the name and ID tags
        $skip_feature = 0; // if there is a problem with any of the attributes this variable gets set
        if (strcmp($tag_name, 'ID') == 0) {
          $attr_uniquename = urldecode($tag[1]);
        }
        elseif (strcmp($tag_name, 'Name') == 0) {
          $attr_name = urldecode($tag[1]);
        }
        elseif (strcmp($tag_name, 'organism') == 0) {
          $attr_organism = urldecode($tag[1]);
          $org_matches = array();
          if (preg_match('/^(.*?):(.*?)$/', $attr_organism, $org_matches)) {
            $values = array(
              'genus' => $org_matches[1],
              'species' => $org_matches[2],
            );
            $org = chado_select_record('organism', array("*"), $values);
            if (count($org) == 0) {
              if ($create_organism) {
                $feature_organism = (object) chado_insert_record('organism', $values);
                if (!$feature_organism) {
                  tripal_report_error('tripal_feature', TRIPAL_ERROR, "Could not add the organism, '%org', from line %line. Skipping this line. ", 
                  array('%org' => $attr_organism, '%line' => $line_num));
                  $skip_feature = 1;
                }
              }
              else {
                tripal_report_error('tripal_feature', TRIPAL_ERROR, "The organism attribute '%org' on line %line does not exist. Skipping this line. ", 
                array('%org' => $attr_organism, '%line' => $line_num));
                $skip_feature = 1;
              }
            }
            else {
              // We found the organism in the database so use it.
              $feature_organism = $org[0];
            }
          }
          else {
            tripal_report_error('tripal_feature', TRIPAL_ERROR, "The organism attribute '%org' on line %line is not properly formated. It " .
              "should be of the form: organism=Genus:species.  Skipping this line.", 
            array('%org' => $attr_organism, '%line' => $line_num));
            $skip_feature = 1;
          }
        }
        // Get the list of non-reserved attributes.
        elseif (strcmp($tag_name, 'Alias') != 0 and strcmp($tag_name, 'Parent') != 0 and 
          strcmp($tag_name, 'Target') != 0 and strcmp($tag_name, 'Gap') != 0 and 
          strcmp($tag_name, 'Derives_from') != 0 and strcmp($tag_name, 'Note') != 0 and 
          strcmp($tag_name, 'Dbxref') != 0 and strcmp($tag_name, 'Ontology_term') != 0 and 
          strcmp($tag_name, 'Is_circular') != 0 and strcmp($tag_name, 'target_organism') != 0 and 
          strcmp($tag_name, 'target_type') != 0 and strcmp($tag_name, 'organism' != 0)) {
          foreach ($tags[$tag_name] as $value) {
            $attr_others[$tag_name][] = $value;
          }
        }
      }
      // If neither name nor uniquename are provided then generate one.
      if (!$attr_uniquename and !$attr_name) {
        // Check if an alternate ID field is suggested, if so, then use
        // that for the name.
        if (array_key_exists($alt_id_attr, $tags)) {
          $attr_uniquename = $tags[$alt_id_attr][0];
          $attr_name = $attr_uniquename;
        }
        // If the row has a parent then generate a uniquename using the parent name
        // add the date to the name in the event there are more than one child with
        // the same parent.
        elseif (array_key_exists('Parent', $tags)) {
          $attr_uniquename = $tags['Parent'][0] . "-$type-$landmark-" . $date[0] . ":" . ($fmin + 1) . ".." . $fmax;
          $attr_name = $attr_uniquename;
        }
        // Generate a unique name based on the date, type and location
        // and set the name to simply be the type.
        else {
          $attr_uniquename = $date[0] . "-$type-$landmark:" . ($fmin + 1) . ".." . $fmax;
          $attr_name = $type;
        }
      }

      // If a name is not specified then use the unique name as the name
      if (strcmp($attr_name, '') == 0) {
        $attr_name = $attr_uniquename;
      }

      // If an ID attribute is not specified then we must generate a
      // unique ID. Do this by combining the attribute name with the date
      // and line number.
      if (!$attr_uniquename) {
        $attr_uniquename = $attr_name . '-' . $date[0] . '-' . $line_num;
      }

      // Make sure the landmark sequence exists in the database.  If the user
      // has not specified a landmark type (and it's not required in the GFF
      // format) then we don't know the type of the landmark so we'll hope
      // that it's unique across all types for the organism. Only do this
      // test if the landmark and the feature are different.
      if (!$remove and !(strcmp($landmark, $attr_uniquename) == 0 or strcmp($landmark, $attr_name) == 0) and !in_array($landmark, $landmark_lookup)) {

        $select = array(
          'organism_id' => $organism->organism_id,
          'uniquename' => $landmark,
        );
        $columns = array('count(*) as num_landmarks');
        if ($landmark_type) {
          $select['type_id'] = array(
            'name' => $landmark_type,
          );
        }
        $count = chado_select_record('feature', $columns, $select);
        if (!$count or count($count) == 0 or $count[0]->num_landmarks == 0) {
          // now look for the landmark using the name rather than uniquename.
          $select = array(
            'organism_id' => $organism->organism_id,
            'name' => $landmark,
          );
          $columns = array('count(*) as num_landmarks');
          if ($landmark_type) {
            $select['type_id'] = array(
              'name' => $landmark_type,
            );
          }
          $count = chado_select_record('feature', $columns, $select);
          if (!$count or count($count) == 0 or $count[0]->num_landmarks == 0) {
            tripal_report_error('tripal_feature', TRIPAL_ERROR, "The landmark '%landmark' cannot be found for this organism (%species) " .
              "Please add the landmark and then retry the import of this GFF3 " .
              "file", array('%landmark' => $landmark, '%species' => $organism->genus . " " . $organism->species));
            return '';
          }
          elseif ($count[0]->num_landmarks > 1) {
            tripal_report_error('tripal_feature', TRIPAL_ERROR, "The landmark '%landmark' has more than one entry for this organism (%species) " .
              "Cannot continue", array('%landmark' => $landmark, '%species' => $organism->genus . " " . $organism->species));
            return '';
          }

        }
        if ($count[0]->num_landmarks > 1) {
          tripal_report_error('tripal_feature', TRIPAL_ERROR, "The landmark '%landmark' is not unique for this organism. " .
            "The features cannot be associated", array('%landmark' => $landmark));
          return '';
        }

        // The landmark was found, remember it
        $landmark_lookup[] = $landmark;
      }
      /*
      // If the option is to remove or refresh then we want to remove
      // the feature from the database.
      if ($remove or $refresh) {
        // Next remove the feature itself.
        $sql = "DELETE FROM {feature}
                WHERE organism_id = %d and uniquename = '%s' and type_id = %d";
        $match = array(
          'organism_id' => $feature_organism->organism_id,
          'uniquename'  => $attr_uniquename,
          'type_id'     => $cvterm->cvterm_id
        );
        $result = chado_delete_record('feature', $match);
        if (!$result) {
          tripal_report_error('tripal_feature', TRIPAL_ERROR, "cannot delete feature %attr_uniquename",
            array('%attr_uniquename' => $attr_uniquename));
        }
        $feature = 0;
        unset($result);
      }
 */
      // Add or update the feature and all properties.
      if ($update or $refresh or $add_only) {

        // Add/update the feature.
        $feature = tripal_feature_load_gff3_feature($feature_organism, $analysis_id, $cvterm, 
        $attr_uniquename, $attr_name, $residues, $attr_is_analysis, 
        $attr_is_obsolete, $add_only, $score);

        if ($feature) {

          // Add a record for this feature to the tripal_gff_temp table for
          // later lookup.
          $values = array(
            'feature_id' => $feature->feature_id,
            'organism_id' => $feature->organism_id,
            'type_name' => $type,
            'uniquename' => $feature->uniquename
          );
          // make sure this record doesn't already exist in our temp table
          $results = chado_select_record('tripal_gff_temp', array('*'), $values);

          if (count($results) == 0) {
            $result = chado_insert_record('tripal_gff_temp', $values);
            if (!$result) {
              tripal_report_error('tripal_feature', TRIPAL_ERROR, "Cound not save record in temporary table, Cannot continue.", array());
              exit;
            }
          }
          // add/update the featureloc if the landmark and the ID are not the same
          // if they are the same then this entry in the GFF is probably a landmark identifier
          if (strcmp($landmark, $attr_uniquename) != 0) {
            tripal_feature_load_gff3_featureloc($feature, $organism, 
            $landmark, $fmin, $fmax, $strand, $phase, $attr_fmin_partial, 
            $attr_fmax_partial, $attr_residue_info, $attr_locgroup);
          }

          // add any aliases for this feature
          if (array_key_exists('Alias', $tags)) {
            tripal_feature_load_gff3_alias($feature, $tags['Alias']);
          }
          // add any dbxrefs for this feature
          if (array_key_exists('Dbxref', $tags)) {
            tripal_feature_load_gff3_dbxref($feature, $tags['Dbxref']);
          }
          // add any ontology terms for this feature
          if (array_key_exists('Ontology_term', $tags)) {
            tripal_feature_load_gff3_ontology($feature, $tags['Ontology_term']);
          }
          // add parent relationships
          if (array_key_exists('Parent', $tags)) {
            tripal_feature_load_gff3_parents($feature, $cvterm, $tags['Parent'], 
            $feature_organism->organism_id, $strand, $phase, $fmin, $fmax);
          }

          // add target relationships
          if (array_key_exists('Target', $tags)) {
            tripal_feature_load_gff3_target($feature, $tags, $target_organism_id, $target_type, $create_target, $attr_locgroup);
          }
          // add gap information.  This goes in simply as a property
          if (array_key_exists('Gap', $tags)) {
            foreach ($tags['Gap'] as $value) {
              tripal_feature_load_gff3_property($feature, 'Gap', $value);
            }
          }
          // add notes. This goes in simply as a property
          if (array_key_exists('Note', $tags)) {
            foreach ($tags['Note'] as $value) {
              tripal_feature_load_gff3_property($feature, 'Note', $value);
            }
          }
          // add the Derives_from relationship (e.g. polycistronic genes).
          if (array_key_exists('Derives_from', $tags)) {
            tripal_feature_load_gff3_derives_from($feature, $cvterm, $tags['Derives_from'][0], 
            $feature_organism, $fmin, $fmax);
          }
          // add in the GFF3_source dbxref so that GBrowse can find the feature using the source column
          $source_ref = array('GFF_source:' . $source);
          tripal_feature_load_gff3_dbxref($feature, $source_ref);
          // add any additional attributes
          if ($attr_others) {
            foreach ($attr_others as $tag_name => $values) {
              foreach ($values as $value) {
                tripal_feature_load_gff3_property($feature, $tag_name, $value);
              }
            }
          }

        }
      }
    }

    // Do some last bit of processing.
    if (!$remove) {

      // First, add any protein sequences if needed.
      $sql = "SELECT feature_id FROM {tripal_gffcds_temp} LIMIT 1 OFFSET 1";
      $has_cds = chado_query($sql)->fetchField();
      if ($has_cds) {
        print "\nAdding protein sequences if CDS exist and no proteins in GFF...\n";
        $sql = "
          SELECT F.feature_id, F.name, F.uniquename, TGCT.strand,
            CVT.cvterm_id, CVT.name as feature_type,
            min(TGCT.fmin) as fmin, max(TGCT.fmax) as fmax,
            TGPT.feature_id as protein_id, TGPT.fmin as protein_fmin,
            TGPT.fmax as protein_fmax, FLM.uniquename as landmark
          FROM {tripal_gffcds_temp} TGCT
            INNER JOIN {feature} F on F.feature_id = TGCT.parent_id
            INNER JOIN {cvterm} CVT on CVT.cvterm_id = F.type_id
            INNER JOIN {featureloc} L on F.feature_id = L.feature_id
            INNER JOIN {feature} FLM on L.srcfeature_id = FLM.feature_id
            LEFT JOIN {tripal_gffprotein_temp} TGPT on TGPT.parent_id = F.feature_id
          GROUP BY F.feature_id, F.name, F.uniquename, CVT.cvterm_id, CVT.name,
            TGPT.feature_id, TGPT.fmin, TGPT.fmax, TGCT.strand, FLM.uniquename
        ";
        $results = chado_query($sql);
        $protein_cvterm = tripal_get_cvterm(array(
          'name' => 'polypeptide',
          'cv_id' => array(
            'name' => 'sequence'
          )
        ));
        while ($result = $results->fetchObject()) {
          // If a protein exists with this same parent then don't add a new
          // protein.
          if (!$result->protein_id) {
            // Get details about this protein
            if ($re_mrna and $re_protein) {
              // We use a regex to generate protein name from mRNA name
              $uname = preg_replace("/$re_mrna/", $re_protein, $result->uniquename);
              $name = $result->name;
            }
            else {
              // No regex, use the default '-protein' suffix
              $uname = $result->uniquename . '-protein';
              $name = $result->name;
            }
            $values = array(
              'parent_id' => $result->feature_id,
              'fmin' => $result->fmin
            );
            $min_phase = chado_select_record('tripal_gffcds_temp', array('phase'), $values);
            $values = array(
              'parent_id' => $result->feature_id,
              'fmax' => $result->fmax
            );
            $max_phase = chado_select_record('tripal_gffcds_temp', array('phase'), $values);

            $pfmin = $result->fmin;
            $pfmax = $result->fmax;
            if ($result->strand == '-1') {
              $pfmax -= $max_phase[0]->phase;
            }
            else {
              $pfmin += $min_phase[0]->phase;
            }

            // Add the new protein record.
            $feature = tripal_feature_load_gff3_feature($organism, $analysis_id, 
            $protein_cvterm, $uname, $name, '', 'f', 'f', 1, 0);
            // Add the derives_from relationship.
            $cvterm = tripal_get_cvterm(array('cvterm_id' => $result->cvterm_id));
            tripal_feature_load_gff3_derives_from($feature, $cvterm, 
            $result->uniquename, $organism, $pfmin, $pfmax);
            // Add the featureloc record. Set the start of the protein to
            // be the start of the coding sequence minus the phase.
            tripal_feature_load_gff3_featureloc($feature, $organism, $result->landmark, 
            $pfmin, $pfmax, $result->strand, '', 'f', 'f', '', 0);
          }
        }
      }

      print "\nSetting ranks of children...\n";

      // Get features in a relationship that are also children of an alignment.
      $sql = "
        SELECT DISTINCT F.feature_id, F.organism_id, F.type_id,
          F.uniquename, FL.strand
        FROM {tripal_gff_temp} TGT
          INNER JOIN {feature} F                ON TGT.feature_id = F.feature_id
          INNER JOIN {feature_relationship} FR  ON FR.object_id   = TGT.feature_id
          INNER JOIN {cvterm} CVT               ON CVT.cvterm_id  = FR.type_id
          INNER JOIN {featureloc} FL            ON FL.feature_id  = F.feature_id
        WHERE CVT.name = 'part_of'
      ";
      $parents = chado_query($sql);

      // Build and prepare the SQL for selecting the children relationship.
      $sel_gffchildren_sql = "
        SELECT DISTINCT FR.feature_relationship_id, FL.fmin, FR.rank
        FROM {feature_relationship} FR
          INNER JOIN {featureloc} FL on FL.feature_id = FR.subject_id
          INNER JOIN {cvterm} CVT on CVT.cvterm_id = FR.type_id
        WHERE FR.object_id = :feature_id AND CVT.name = 'part_of'
        ORDER BY FL.fmin ASC
      ";

      // Now set the rank of any parent/child relationships.  The order is based
      // on the fmin.  The start rank is 1.  This allows features with other
      // relationships to be '0' (the default), and doesn't interfer with the
      // ordering defined here.
      $num_recs = $parents->rowCount();
      $i = 1;
      $interval = intval($num_recs * 0.0001);
      if ($interval == 0) {
        $interval = 1;
      }
      $percent = sprintf("%.2f", ($i / $num_recs) * 100);
      print "Setting $i of $num_recs (" . $percent . "%). Memory: " . number_format(memory_get_usage()) . " bytes.\r";

      while ($parent = $parents->fetchObject()) {

        if ($i % $interval == 0) {
          $percent = sprintf("%.2f", ($i / $num_recs) * 100);
          print "Setting $i of $num_recs (" . $percent . "%). Memory: " . number_format(memory_get_usage()) . " bytes.\r";
        }

        // get the children
        $result = chado_query($sel_gffchildren_sql, array(':feature_id' => $parent->feature_id));

        // build an array of the children
        $children = array();
        while ($child = $result->fetchObject()) {
          $children[] = $child;
        }

        // the children list comes sorted in ascending fmin
        // but if the parent is on the reverse strand we need to
        // reverse the order of the children.
        if ($parent->strand == -1) {
          arsort($children);
        }

        // first set the ranks to a negative number so that we don't
        // get a duplicate error message when we try to change any of them
        $rank = -1;
        foreach ($children as $child) {
          $match = array('feature_relationship_id' => $child->feature_relationship_id);
          $values = array('rank' => $rank);
          chado_update_record('feature_relationship', $match, $values);
          $rank--;
        }
        // now set the rank correctly. The rank should start at 0.
        $rank = 0;
        foreach ($children as $child) {
          $match = array('feature_relationship_id' => $child->feature_relationship_id);
          $values = array('rank' => $rank);
          //print "Was: " . $child->rank . " now $rank ($parent->strand)\n"     ;
          chado_update_record('feature_relationship', $match, $values);
          $rank++;
        }
        $i++;
      }
    }
  }
  catch (Exception $e) {
    print "\n"; // make sure we start errors on new line
    if ($use_transaction) {
      $transaction->rollback();
      print "FAILED: Rolling back database changes...\n";
    }
    else {
      print "FAILED\n";
    }
    watchdog_exception('tripal_feature', $e);
    return 0;
  }

  print "\nDone\n";
  return 1;
}

Tripal Search

function tripal_feature_load_gff3

Parameters

Related topics

File

Code